[jira] [Resolved] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37049.
---
Fix Version/s: 3.1.3
   3.2.1
   3.3.0
   Resolution: Fixed

Issue resolved by pull request 34319
[https://github.com/apache/spark/pull/34319]

> executorIdleTimeout is not working for pending pods on K8s
> --
>
> Key: SPARK-37049
> URL: https://issues.apache.org/jira/browse/SPARK-37049
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: wwei
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
>
> SPARK-33099 added the support to respect 
> "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
> However, when it checks if a pending executor pod is timed out, it checks 
> against the pod's "startTime". A pending pod "startTime" is empty, and this 
> causes the function "isExecutorIdleTimedOut()" always return true for pending 
> pods.
> This caused the issue, pending pods are deleted immediately when a stage is 
> finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37049:
-

Assignee: wwei

> executorIdleTimeout is not working for pending pods on K8s
> --
>
> Key: SPARK-37049
> URL: https://issues.apache.org/jira/browse/SPARK-37049
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: wwei
>Priority: Major
>
> SPARK-33099 added the support to respect 
> "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. 
> However, when it checks if a pending executor pod is timed out, it checks 
> against the pod's "startTime". A pending pod "startTime" is empty, and this 
> causes the function "isExecutorIdleTimedOut()" always return true for pending 
> pods.
> This caused the issue, pending pods are deleted immediately when a stage is 
> finished and several new pods got recreated again in the next stage. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27696) kubernetes driver pod not deleted after finish.

2021-10-19 Thread Jin Xing (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430971#comment-17430971
 ] 

Jin Xing commented on SPARK-27696:
--

We also suffer from this issue. From our side, we hope Spark support an 
approach to do self clean up for driver pod.

[~Andrew HUALI] Would you please share the PR or give some insights about 
spark.kubernetes.submission.deleteCompletedPod ?

> kubernetes driver pod not deleted after finish.
> ---
>
> Key: SPARK-27696
> URL: https://issues.apache.org/jira/browse/SPARK-27696
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Henry Yu
>Priority: Minor
>
> When submit to k8s, driver pod not deleted after job completion. 
> While k8s checks driver pod name not existing, It is especially painful when 
> we use workflow tool to resubmit the failed spark job. (by the way, client 
> always exist with 0 is another painful issue)
> I have fix this with a new config 
> spark.kubernetes.submission.deleteCompletedPod=true in our home maintained 
> spark version. 
>  Do you guys have more insights or I can make a pr on this issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")

2021-10-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36348.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/34335

> unexpected Index loaded: pd.Index([10, 20, None], name="x")
> ---
>
> Key: SPARK-36348
> URL: https://issues.apache.org/jira/browse/SPARK-36348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> {code:python}
> pidx = pd.Index([10, 20, 15, 30, 45, None], name="x")
> psidx = ps.Index(pidx)
> self.assert_eq(psidx.astype(str), pidx.astype(str))
> {code}
> [left pandas on spark]:  Index(['10.0', '20.0', '15.0', '30.0', '45.0', 
> 'nan'], dtype='object', name='x')
> [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', 
> name='x')
> The index is loaded as float64, so the follow step like astype would be diff 
> with pandas
> [1] 
> https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37068) Confusing tgz filename for download

2021-10-19 Thread James Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Yu updated SPARK-37068:
-
Attachment: spark-download-issue.png

> Confusing tgz filename for download
> ---
>
> Key: SPARK-37068
> URL: https://issues.apache.org/jira/browse/SPARK-37068
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.2.0
>Reporter: James Yu
>Priority: Minor
> Attachments: spark-download-issue.png
>
>
> In the Spark download webpage [https://spark.apache.org/downloads.html], the 
> package type dropdown says "Hadoop 3.3", but the Download Spark tgz filename 
> contains "hadoop3.2" in it.  It is confusing; which version is correct?
>  
> Download Apache Spark(TM)
>  # Choose a Spark release: 3.2.0 (Oct 13 2021)
>  # Choose a package type: Pre-built for Apache Hadoop 3.3 and later
>  # Download Spark: spark-3.2.0-bin-hadoop3.2.tgz
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37068) Confusing tgz filename for download

2021-10-19 Thread James Yu (Jira)
James Yu created SPARK-37068:


 Summary: Confusing tgz filename for download
 Key: SPARK-37068
 URL: https://issues.apache.org/jira/browse/SPARK-37068
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 3.2.0
Reporter: James Yu


In the Spark download webpage [https://spark.apache.org/downloads.html], the 
package type dropdown says "Hadoop 3.3", but the Download Spark tgz filename 
contains "hadoop3.2" in it.  It is confusing; which version is correct?

 

Download Apache Spark(TM)
 # Choose a Spark release: 3.2.0 (Oct 13 2021)
 # Choose a package type: Pre-built for Apache Hadoop 3.3 and later
 # Download Spark: spark-3.2.0-bin-hadoop3.2.tgz

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430872#comment-17430872
 ] 

Apache Spark commented on SPARK-37067:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/34338

> DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
> 
>
> Key: SPARK-37067
> URL: https://issues.apache.org/jira/browse/SPARK-37067
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Linhong Liu
>Priority: Major
>
> For the zoneid with format like "+" or "+0730", it can be parsed by 
> `ZoneId.of()` but will rejected by Spark's 
> `DateTimeUtils.stringToTimestamp()`. it means we will return null for some 
> valid datetime string, such as: `2021-10-11T03:58:03.000+0700`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430871#comment-17430871
 ] 

Apache Spark commented on SPARK-37067:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/34338

> DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
> 
>
> Key: SPARK-37067
> URL: https://issues.apache.org/jira/browse/SPARK-37067
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Linhong Liu
>Priority: Major
>
> For the zoneid with format like "+" or "+0730", it can be parsed by 
> `ZoneId.of()` but will rejected by Spark's 
> `DateTimeUtils.stringToTimestamp()`. it means we will return null for some 
> valid datetime string, such as: `2021-10-11T03:58:03.000+0700`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37067:


Assignee: (was: Apache Spark)

> DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
> 
>
> Key: SPARK-37067
> URL: https://issues.apache.org/jira/browse/SPARK-37067
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Linhong Liu
>Priority: Major
>
> For the zoneid with format like "+" or "+0730", it can be parsed by 
> `ZoneId.of()` but will rejected by Spark's 
> `DateTimeUtils.stringToTimestamp()`. it means we will return null for some 
> valid datetime string, such as: `2021-10-11T03:58:03.000+0700`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37067:


Assignee: Apache Spark

> DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
> 
>
> Key: SPARK-37067
> URL: https://issues.apache.org/jira/browse/SPARK-37067
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Linhong Liu
>Assignee: Apache Spark
>Priority: Major
>
> For the zoneid with format like "+" or "+0730", it can be parsed by 
> `ZoneId.of()` but will rejected by Spark's 
> `DateTimeUtils.stringToTimestamp()`. it means we will return null for some 
> valid datetime string, such as: `2021-10-11T03:58:03.000+0700`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon

2021-10-19 Thread Linhong Liu (Jira)
Linhong Liu created SPARK-37067:
---

 Summary: DateTimeUtils.stringToTimestamp() incorrectly rejects 
timezone without colon
 Key: SPARK-37067
 URL: https://issues.apache.org/jira/browse/SPARK-37067
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0, 3.1.0
Reporter: Linhong Liu


For the zoneid with format like "+" or "+0730", it can be parsed by 
`ZoneId.of()` but will rejected by Spark's `DateTimeUtils.stringToTimestamp()`. 
it means we will return null for some valid datetime string, such as: 
`2021-10-11T03:58:03.000+0700`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")

2021-10-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36348:


Assignee: Yikun Jiang

> unexpected Index loaded: pd.Index([10, 20, None], name="x")
> ---
>
> Key: SPARK-36348
> URL: https://issues.apache.org/jira/browse/SPARK-36348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> {code:python}
> pidx = pd.Index([10, 20, 15, 30, 45, None], name="x")
> psidx = ps.Index(pidx)
> self.assert_eq(psidx.astype(str), pidx.astype(str))
> {code}
> [left pandas on spark]:  Index(['10.0', '20.0', '15.0', '30.0', '45.0', 
> 'nan'], dtype='object', name='x')
> [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', 
> name='x')
> The index is loaded as float64, so the follow step like astype would be diff 
> with pandas
> [1] 
> https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37066) Improve ORC RecordReader's error message

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430858#comment-17430858
 ] 

Apache Spark commented on SPARK-37066:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34337

> Improve ORC RecordReader's error message
> 
>
> Key: SPARK-37066
> URL: https://issues.apache.org/jira/browse/SPARK-37066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> In below error message, we don't know the actual file path
> {code}
> 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID 
> 257)
> java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517)
>   at 
> org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37066) Improve ORC RecordReader's error message

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37066:


Assignee: Apache Spark

> Improve ORC RecordReader's error message
> 
>
> Key: SPARK-37066
> URL: https://issues.apache.org/jira/browse/SPARK-37066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> In below error message, we don't know the actual file path
> {code}
> 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID 
> 257)
> java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517)
>   at 
> org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37066) Improve ORC RecordReader's error message

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430857#comment-17430857
 ] 

Apache Spark commented on SPARK-37066:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34337

> Improve ORC RecordReader's error message
> 
>
> Key: SPARK-37066
> URL: https://issues.apache.org/jira/browse/SPARK-37066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> In below error message, we don't know the actual file path
> {code}
> 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID 
> 257)
> java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517)
>   at 
> org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37066) Improve ORC RecordReader's error message

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37066:


Assignee: (was: Apache Spark)

> Improve ORC RecordReader's error message
> 
>
> Key: SPARK-37066
> URL: https://issues.apache.org/jira/browse/SPARK-37066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> In below error message, we don't know the actual file path
> {code}
> 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID 
> 257)
> java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517)
>   at 
> org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37066) Improve ORC RecordReader's error message

2021-10-19 Thread angerszhu (Jira)
angerszhu created SPARK-37066:
-

 Summary: Improve ORC RecordReader's error message
 Key: SPARK-37066
 URL: https://issues.apache.org/jira/browse/SPARK-37066
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


In below error message, we don't know the actual file path
{code}

21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID 257)
java.lang.ArrayIndexOutOfBoundsException: 1024
at 
org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292)
at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820)
at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517)
at 
org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802)
at 
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)
at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324)
at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196)
at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37035) Improve error message when use vectorize reader

2021-10-19 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-37035:
--
Parent: SPARK-37065
Issue Type: Sub-task  (was: Improvement)

> Improve error message when use vectorize reader
> ---
>
> Key: SPARK-37035
> URL: https://issues.apache.org/jira/browse/SPARK-37035
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Vectorized reader won't show which file read failed.
>  
> None-vectorize parquet reader 
> {code}
> cutionException: Encounter error while reading parquet files. One possible 
> cause: Parquet column cannot be converted in the corresponding files. Details:
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:193)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 1 in block 0 in file hdfs://path/to/failed/file
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>   ... 15 more
> {code}
> Vectorize parquet reader
> {code}
> 21/10/15 18:01:54 WARN TaskSetManager: Lost task 1881.0 in stage 16.0 (TID 
> 10380, ip-10-130-169-140.idata-server.shopee.io, executor 168): TaskKilled 
> (Stage cancelled)
> : An error occurred while calling o362.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 963 
> in stage 17.0 failed 4 times, most recent failure: Lost task 963.3 in stage 
> 17.0 (TID 10351, ip-10-130-75-201.idata-server.shopee.io, executor 99): 
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
>   at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
>   at 
> org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getLong(MutableColumnarRow.java:120)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anonfun$doExecute$2$$anonfun$apply$2.apply(DataSourceScanExec.scala:351)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anonfun$doExecute$2$$anonfun$apply$2.apply(DataSourceScanExec.scala:349)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> 

[jira] [Created] (SPARK-37065) improve File reader's error message for each data source

2021-10-19 Thread angerszhu (Jira)
angerszhu created SPARK-37065:
-

 Summary: improve File reader's error message for each data source
 Key: SPARK-37065
 URL: https://issues.apache.org/jira/browse/SPARK-37065
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


improve File reader's error message for each data source



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37044) Add Row to __all__ in pyspark.sql.types

2021-10-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37044.
--
Fix Version/s: 3.3.0
 Assignee: Maciej Szymkiewicz
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/34332

> Add Row to __all__ in pyspark.sql.types
> ---
>
> Key: SPARK-37044
> URL: https://issues.apache.org/jira/browse/SPARK-37044
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from 
> {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import 
> *}} won't import {{Row}}.
> It might be counter-intuitive, especially when we import {{Row}} from 
> {{types}} in {{examples}}.
> Should we add it to {{__all__}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36763) Pull out ordering expressions

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36763:


Assignee: Apache Spark

> Pull out ordering expressions
> -
>
> Key: SPARK-36763
> URL: https://issues.apache.org/jira/browse/SPARK-36763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Similar to 
> [PullOutGroupingExpressions|https://github.com/apache/spark/blob/7fd3f8f9ec55b364525407213ba1c631705686c5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala#L48].
>  We can pull out ordering expressions to improve order performance. For 
> example:
> {code:scala}
> sql("create table t1(a int, b int) using parquet")
> sql("insert into t1 values (1, 2)")
> sql("insert into t1 values (3, 4)")
> sql("select * from t1 order by a - b").explain
> {code}
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Sort [(a#12 - b#13) ASC NULLS FIRST], true, 0
>+- Exchange rangepartitioning((a#12 - b#13) ASC NULLS FIRST, 5), 
> ENSURE_REQUIREMENTS, [id=#39]
>   +- FileScan parquet default.t1[a#12,b#13]
> {noformat}
> The {{Subtract}} will be evaluated 4 times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36763) Pull out ordering expressions

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36763:


Assignee: (was: Apache Spark)

> Pull out ordering expressions
> -
>
> Key: SPARK-36763
> URL: https://issues.apache.org/jira/browse/SPARK-36763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Similar to 
> [PullOutGroupingExpressions|https://github.com/apache/spark/blob/7fd3f8f9ec55b364525407213ba1c631705686c5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala#L48].
>  We can pull out ordering expressions to improve order performance. For 
> example:
> {code:scala}
> sql("create table t1(a int, b int) using parquet")
> sql("insert into t1 values (1, 2)")
> sql("insert into t1 values (3, 4)")
> sql("select * from t1 order by a - b").explain
> {code}
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Sort [(a#12 - b#13) ASC NULLS FIRST], true, 0
>+- Exchange rangepartitioning((a#12 - b#13) ASC NULLS FIRST, 5), 
> ENSURE_REQUIREMENTS, [id=#39]
>   +- FileScan parquet default.t1[a#12,b#13]
> {noformat}
> The {{Subtract}} will be evaluated 4 times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37064) Fix outer join return the wrong max rows if other side is empty

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37064:


Assignee: Apache Spark

> Fix outer join return the wrong max rows if other side is empty
> ---
>
> Key: SPARK-37064
> URL: https://issues.apache.org/jira/browse/SPARK-37064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> Outer join should return at least num rows of it's outer side, i.e left outer 
> join with its left side, right outer join with its right side, full outer 
> join with its both side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37064) Fix outer join return the wrong max rows if other side is empty

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430846#comment-17430846
 ] 

Apache Spark commented on SPARK-37064:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34336

> Fix outer join return the wrong max rows if other side is empty
> ---
>
> Key: SPARK-37064
> URL: https://issues.apache.org/jira/browse/SPARK-37064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> Outer join should return at least num rows of it's outer side, i.e left outer 
> join with its left side, right outer join with its right side, full outer 
> join with its both side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36763) Pull out ordering expressions

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430847#comment-17430847
 ] 

Apache Spark commented on SPARK-36763:
--

User 'RabbidHY' has created a pull request for this issue:
https://github.com/apache/spark/pull/34334

> Pull out ordering expressions
> -
>
> Key: SPARK-36763
> URL: https://issues.apache.org/jira/browse/SPARK-36763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Similar to 
> [PullOutGroupingExpressions|https://github.com/apache/spark/blob/7fd3f8f9ec55b364525407213ba1c631705686c5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala#L48].
>  We can pull out ordering expressions to improve order performance. For 
> example:
> {code:scala}
> sql("create table t1(a int, b int) using parquet")
> sql("insert into t1 values (1, 2)")
> sql("insert into t1 values (3, 4)")
> sql("select * from t1 order by a - b").explain
> {code}
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Sort [(a#12 - b#13) ASC NULLS FIRST], true, 0
>+- Exchange rangepartitioning((a#12 - b#13) ASC NULLS FIRST, 5), 
> ENSURE_REQUIREMENTS, [id=#39]
>   +- FileScan parquet default.t1[a#12,b#13]
> {noformat}
> The {{Subtract}} will be evaluated 4 times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36348:


Assignee: (was: Apache Spark)

> unexpected Index loaded: pd.Index([10, 20, None], name="x")
> ---
>
> Key: SPARK-36348
> URL: https://issues.apache.org/jira/browse/SPARK-36348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:python}
> pidx = pd.Index([10, 20, 15, 30, 45, None], name="x")
> psidx = ps.Index(pidx)
> self.assert_eq(psidx.astype(str), pidx.astype(str))
> {code}
> [left pandas on spark]:  Index(['10.0', '20.0', '15.0', '30.0', '45.0', 
> 'nan'], dtype='object', name='x')
> [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', 
> name='x')
> The index is loaded as float64, so the follow step like astype would be diff 
> with pandas
> [1] 
> https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37064) Fix outer join return the wrong max rows if other side is empty

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37064:


Assignee: (was: Apache Spark)

> Fix outer join return the wrong max rows if other side is empty
> ---
>
> Key: SPARK-37064
> URL: https://issues.apache.org/jira/browse/SPARK-37064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> Outer join should return at least num rows of it's outer side, i.e left outer 
> join with its left side, right outer join with its right side, full outer 
> join with its both side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36348:


Assignee: Apache Spark

> unexpected Index loaded: pd.Index([10, 20, None], name="x")
> ---
>
> Key: SPARK-36348
> URL: https://issues.apache.org/jira/browse/SPARK-36348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> {code:python}
> pidx = pd.Index([10, 20, 15, 30, 45, None], name="x")
> psidx = ps.Index(pidx)
> self.assert_eq(psidx.astype(str), pidx.astype(str))
> {code}
> [left pandas on spark]:  Index(['10.0', '20.0', '15.0', '30.0', '45.0', 
> 'nan'], dtype='object', name='x')
> [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', 
> name='x')
> The index is loaded as float64, so the follow step like astype would be diff 
> with pandas
> [1] 
> https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430845#comment-17430845
 ] 

Apache Spark commented on SPARK-36348:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34335

> unexpected Index loaded: pd.Index([10, 20, None], name="x")
> ---
>
> Key: SPARK-36348
> URL: https://issues.apache.org/jira/browse/SPARK-36348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:python}
> pidx = pd.Index([10, 20, 15, 30, 45, None], name="x")
> psidx = ps.Index(pidx)
> self.assert_eq(psidx.astype(str), pidx.astype(str))
> {code}
> [left pandas on spark]:  Index(['10.0', '20.0', '15.0', '30.0', '45.0', 
> 'nan'], dtype='object', name='x')
> [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', 
> name='x')
> The index is loaded as float64, so the follow step like astype would be diff 
> with pandas
> [1] 
> https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37064) Fix outer join return the wrong max rows if other side is empty

2021-10-19 Thread XiDuo You (Jira)
XiDuo You created SPARK-37064:
-

 Summary: Fix outer join return the wrong max rows if other side is 
empty
 Key: SPARK-37064
 URL: https://issues.apache.org/jira/browse/SPARK-37064
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.3.0
Reporter: XiDuo You


Outer join should return at least num rows of it's outer side, i.e left outer 
join with its left side, right outer join with its right side, full outer join 
with its both side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37063) SQL Adaptive Query Execution QA: Phase 2

2021-10-19 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430838#comment-17430838
 ] 

XiDuo You commented on SPARK-37063:
---

thank you [~dongjoon] for creating this umbrella !

> SQL Adaptive Query Execution QA: Phase 2
> 
>
> Key: SPARK-37063
> URL: https://issues.apache.org/jira/browse/SPARK-37063
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37002) Introduce the 'compute.eager_check' option

2021-10-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37002.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34281
[https://github.com/apache/spark/pull/34281]

> Introduce the 'compute.eager_check' option
> --
>
> Key: SPARK-37002
> URL: https://issues.apache.org/jira/browse/SPARK-37002
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>
> https://issues.apache.org/jira/browse/SPARK-36968
> [https://github.com/apache/spark/pull/34235]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37002) Introduce the 'compute.eager_check' option

2021-10-19 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37002:


Assignee: dch nguyen

> Introduce the 'compute.eager_check' option
> --
>
> Key: SPARK-37002
> URL: https://issues.apache.org/jira/browse/SPARK-37002
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-36968
> [https://github.com/apache/spark/pull/34235]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests

2021-10-19 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-37059.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34330

> Ensure the sort order of the output in the PySpark doctests
> ---
>
> Key: SPARK-37059
> URL: https://issues.apache.org/jira/browse/SPARK-37059
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0
>
>
> The collect_set builtin function doesn't ensure the sort order of its result 
> for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order 
> of the result rows.
> Nevertheless, their PySpark doctests assume a certain kind of sort order, 
> causing that such doctests fail with Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS

2021-10-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-37041.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34312
[https://github.com/apache/spark/pull/34312]

> Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
> --
>
> Key: SPARK-37041
> URL: https://issues.apache.org/jira/browse/SPARK-37041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy 
> upgrade Thrift.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS

2021-10-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-37041:
---

Assignee: Yuming Wang

> Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
> --
>
> Key: SPARK-37041
> URL: https://issues.apache.org/jira/browse/SPARK-37041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy 
> upgrade Thrift.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37043) Cancel all running job after AQE plan finished

2021-10-19 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430797#comment-17430797
 ] 

Erik Krogen commented on SPARK-37043:
-

[~ulysses] any concerns if I make this a sub-task of SPARK-37063 to track it 
alongside other AQE fixes?

> Cancel all running job after AQE plan finished
> --
>
> Key: SPARK-37043
> URL: https://issues.apache.org/jira/browse/SPARK-37043
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> We see stage was still running after AQE plan finished. This is because the 
> plan which contains a empty join has been converted to `LocalTableScanExec` 
> during `AQEOptimizer`, but the other side of this join is still running 
> (shuffle map stage).
>  
> It's no meaning to keep running the stage, It's better to cancel the running 
> stage after AQE plan finished in case wasting the task resource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33828) SQL Adaptive Query Execution QA

2021-10-19 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430798#comment-17430798
 ] 

Erik Krogen commented on SPARK-33828:
-

Thanks [~dongjoon]!

> SQL Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA 
> issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by 
> default and collect all information in order to do QA for this feature in 
> Apache Spark 3.2.0 timeframe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37062:


Assignee: (was: Apache Spark)

> Introduce a new data source for providing consistent set of rows per 
> microbatch
> ---
>
> Key: SPARK-37062
> URL: https://issues.apache.org/jira/browse/SPARK-37062
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The "rate" data source has been known to be used as a benchmark for streaming 
> query.
> While this helps to put the query to the limit (how many rows the query could 
> process per second), the rate data source doesn't provide consistent rows per 
> batch into stream, which leads two environments be hard to compare with.
> For example, in many cases, you may want to compare the metrics in the 
> batches between test environments (like running same streaming query with 
> different options). These metrics are strongly affected if the distribution 
> of input rows in batches are changing, especially a micro-batch has been 
> lagged (in any reason) and rate data source produces more input rows to the 
> next batch.
> Also, when you test against streaming aggregation, you may want the data 
> source produces the same set of input rows per batch (deterministic), so that 
> you can plan how these input rows will be aggregated and how state rows will 
> be evicted, and craft the test query based on the plan.
> The requirements of new data source would follow:
> * it should produce a specific number of input rows as requested
> * it should also include a timestamp (event time) into each row
> ** to make the input rows fully deterministic, timestamp should be configured 
> as well (like start timestamp & amount of advance per batch)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37062:


Assignee: Apache Spark

> Introduce a new data source for providing consistent set of rows per 
> microbatch
> ---
>
> Key: SPARK-37062
> URL: https://issues.apache.org/jira/browse/SPARK-37062
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> The "rate" data source has been known to be used as a benchmark for streaming 
> query.
> While this helps to put the query to the limit (how many rows the query could 
> process per second), the rate data source doesn't provide consistent rows per 
> batch into stream, which leads two environments be hard to compare with.
> For example, in many cases, you may want to compare the metrics in the 
> batches between test environments (like running same streaming query with 
> different options). These metrics are strongly affected if the distribution 
> of input rows in batches are changing, especially a micro-batch has been 
> lagged (in any reason) and rate data source produces more input rows to the 
> next batch.
> Also, when you test against streaming aggregation, you may want the data 
> source produces the same set of input rows per batch (deterministic), so that 
> you can plan how these input rows will be aggregated and how state rows will 
> be evicted, and craft the test query based on the plan.
> The requirements of new data source would follow:
> * it should produce a specific number of input rows as requested
> * it should also include a timestamp (event time) into each row
> ** to make the input rows fully deterministic, timestamp should be configured 
> as well (like start timestamp & amount of advance per batch)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430796#comment-17430796
 ] 

Apache Spark commented on SPARK-37062:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/34333

> Introduce a new data source for providing consistent set of rows per 
> microbatch
> ---
>
> Key: SPARK-37062
> URL: https://issues.apache.org/jira/browse/SPARK-37062
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The "rate" data source has been known to be used as a benchmark for streaming 
> query.
> While this helps to put the query to the limit (how many rows the query could 
> process per second), the rate data source doesn't provide consistent rows per 
> batch into stream, which leads two environments be hard to compare with.
> For example, in many cases, you may want to compare the metrics in the 
> batches between test environments (like running same streaming query with 
> different options). These metrics are strongly affected if the distribution 
> of input rows in batches are changing, especially a micro-batch has been 
> lagged (in any reason) and rate data source produces more input rows to the 
> next batch.
> Also, when you test against streaming aggregation, you may want the data 
> source produces the same set of input rows per batch (deterministic), so that 
> you can plan how these input rows will be aggregated and how state rows will 
> be evicted, and craft the test query based on the plan.
> The requirements of new data source would follow:
> * it should produce a specific number of input rows as requested
> * it should also include a timestamp (event time) into each row
> ** to make the input rows fully deterministic, timestamp should be configured 
> as well (like start timestamp & amount of advance per batch)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33828) SQL Adaptive Query Execution QA

2021-10-19 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430790#comment-17430790
 ] 

Dongjoon Hyun commented on SPARK-33828:
---

To All.
This JIRA contains all efforts delivered at Apache Spark 3.2.0.
SPARK-37063 will lead the remaining efforts in Apache Spark 3.3 timeframe.

> SQL Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA 
> issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by 
> default and collect all information in order to do QA for this feature in 
> Apache Spark 3.2.0 timeframe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33828) SQL Adaptive Query Execution QA

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33828.
---
Fix Version/s: 3.2.0
   Resolution: Done

> SQL Adaptive Query Execution QA
> ---
>
> Key: SPARK-33828
> URL: https://issues.apache.org/jira/browse/SPARK-33828
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA 
> issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by 
> default and collect all information in order to do QA for this feature in 
> Apache Spark 3.2.0 timeframe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36443) Demote BroadcastJoin causes performance regression and increases OOM risks

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36443:
--
Parent: SPARK-37063
Issue Type: Sub-task  (was: Bug)

> Demote BroadcastJoin causes performance regression and increases OOM risks
> --
>
> Key: SPARK-36443
> URL: https://issues.apache.org/jira/browse/SPARK-36443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kent Yao
>Priority: Major
> Attachments: image-2021-08-06-11-24-34-122.png, 
> image-2021-08-06-17-57-15-765.png, screenshot-1.png
>
>
>  
> h2. A test case
> Use bin/spark-sql with local mode and all other default settings with 3.1.2 
> to run the case below
> {code:sql}
> // Some comments here
> set spark.sql.shuffle.partitions=20;
> set spark.sql.adaptive.enabled=true;
> -- set spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin=0; -- 
> (default 0.2)enable this for not demote bhj
> set spark.sql.autoBroadcastJoinThreshold=200;
> SELECT
>   l.id % 12345 k,
>   sum(l.id) sum,
>   count(l.id) cnt,
>   avg(l.id) avg,
>   min(l.id) min,
>   max(l.id) max
> from (select id % 3 id from range(0, 1e8, 1, 100)) l
>   left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 100) 
> group by gid) r ON l.id = r.id
> GROUP BY 1;
> {code}
>  
>  1. demote bhj w/ nonEmptyPartitionRatioForBroadcastJoin comment out
>  
> | |
> ||[Job Id 
> ▾|http://localhost:4040/jobs/?=Job+Id=false=100#completed]||[Description|http://localhost:4040/jobs/?=Description=100#completed]||[Submitted|http://localhost:4040/jobs/?=Submitted=100#completed]||[Duration|http://localhost:4040/jobs/?=Duration=100#completed]||Stages:
>  Succeeded/Total||Tasks (for all stages): Succeeded/Total||
> |4|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=4]|2021/08/06
>  17:31:37|71 ms|1/1 (4 skipped)|3/3 (205 skipped) 
>   |
> |3|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=3]|2021/08/06
>  17:31:18|19 s|1/1 (3 skipped)|4/4 (201 skipped) 
>   |
> |2|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=2]|2021/08/06
>  17:31:18|87 ms|1/1 (1 skipped)|1/1 (100 skipped) 
>   |
> |1|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=1]|2021/08/06
>  17:31:16|2 s|1/1|100/100 
>   |
> |0|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=0]|2021/08/06
>  17:31:15|2 s|1/1|100/100 |
> 2. set nonEmptyPartitionRatioForBroadcastJoin to 0 to tell spark not to 
> demote bhj
>  
> ||[Job Id (Job Group) 
> ▾|http://localhost:4040/jobs/?=Job+Id+%28Job+Group%29=false=100#completed]||[Description|http://localhost:4040/jobs/?=Description=100#completed]||[Submitted|http://localhost:4040/jobs/?=Submitted=100#completed]||[Duration|http://localhost:4040/jobs/?=Duration=100#completed]||Stages:
>  Succeeded/Total||Tasks (for all stages): Succeeded/Total||
> |5|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=5]|2021/08/06
>  18:25:15|29 ms|1/1 (2 skipped)|3/3 (200 skipped) 
>   |
> |4|SELECT l.id % 12345 

[jira] [Updated] (SPARK-35540) Make config maxShuffledHashJoinLocalMapThreshold fallback to advisoryPartitionSizeInBytes

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35540:
--
Parent: SPARK-37063
Issue Type: Sub-task  (was: Bug)

> Make config maxShuffledHashJoinLocalMapThreshold fallback to 
> advisoryPartitionSizeInBytes
> -
>
> Key: SPARK-35540
> URL: https://issues.apache.org/jira/browse/SPARK-35540
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Major
>
> Make config `spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold` 
> fallback to `spark.sql.adaptive.advisoryPartitionSizeInBytes` so that we can 
> convert SMJ to SHJ in AQE by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36443) Demote BroadcastJoin causes performance regression and increases OOM risks

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36443:
--
Parent: (was: SPARK-33828)
Issue Type: Bug  (was: Sub-task)

> Demote BroadcastJoin causes performance regression and increases OOM risks
> --
>
> Key: SPARK-36443
> URL: https://issues.apache.org/jira/browse/SPARK-36443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kent Yao
>Priority: Major
> Attachments: image-2021-08-06-11-24-34-122.png, 
> image-2021-08-06-17-57-15-765.png, screenshot-1.png
>
>
>  
> h2. A test case
> Use bin/spark-sql with local mode and all other default settings with 3.1.2 
> to run the case below
> {code:sql}
> // Some comments here
> set spark.sql.shuffle.partitions=20;
> set spark.sql.adaptive.enabled=true;
> -- set spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin=0; -- 
> (default 0.2)enable this for not demote bhj
> set spark.sql.autoBroadcastJoinThreshold=200;
> SELECT
>   l.id % 12345 k,
>   sum(l.id) sum,
>   count(l.id) cnt,
>   avg(l.id) avg,
>   min(l.id) min,
>   max(l.id) max
> from (select id % 3 id from range(0, 1e8, 1, 100)) l
>   left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 100) 
> group by gid) r ON l.id = r.id
> GROUP BY 1;
> {code}
>  
>  1. demote bhj w/ nonEmptyPartitionRatioForBroadcastJoin comment out
>  
> | |
> ||[Job Id 
> ▾|http://localhost:4040/jobs/?=Job+Id=false=100#completed]||[Description|http://localhost:4040/jobs/?=Description=100#completed]||[Submitted|http://localhost:4040/jobs/?=Submitted=100#completed]||[Duration|http://localhost:4040/jobs/?=Duration=100#completed]||Stages:
>  Succeeded/Total||Tasks (for all stages): Succeeded/Total||
> |4|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=4]|2021/08/06
>  17:31:37|71 ms|1/1 (4 skipped)|3/3 (205 skipped) 
>   |
> |3|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=3]|2021/08/06
>  17:31:18|19 s|1/1 (3 skipped)|4/4 (201 skipped) 
>   |
> |2|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=2]|2021/08/06
>  17:31:18|87 ms|1/1 (1 skipped)|1/1 (100 skipped) 
>   |
> |1|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=1]|2021/08/06
>  17:31:16|2 s|1/1|100/100 
>   |
> |0|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=0]|2021/08/06
>  17:31:15|2 s|1/1|100/100 |
> 2. set nonEmptyPartitionRatioForBroadcastJoin to 0 to tell spark not to 
> demote bhj
>  
> ||[Job Id (Job Group) 
> ▾|http://localhost:4040/jobs/?=Job+Id+%28Job+Group%29=false=100#completed]||[Description|http://localhost:4040/jobs/?=Description=100#completed]||[Submitted|http://localhost:4040/jobs/?=Submitted=100#completed]||[Duration|http://localhost:4040/jobs/?=Duration=100#completed]||Stages:
>  Succeeded/Total||Tasks (for all stages): Succeeded/Total||
> |5|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, 
> min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, 
> 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 
> 100) group by gid) r ON l.id = r.id GROUP BY 1[main at 
> NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=5]|2021/08/06
>  18:25:15|29 ms|1/1 (2 skipped)|3/3 (200 skipped) 
>   |
> |4|SELECT l.id % 

[jira] [Updated] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32304:
--
Parent: SPARK-37063
Issue Type: Sub-task  (was: Bug)

> Flaky Test: AdaptiveQueryExecSuite.multiple joins
> -
>
> Key: SPARK-32304
> URL: https://issues.apache.org/jira/browse/SPARK-32304
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> AdaptiveQueryExecSuite:
> - Change merge join to broadcast join (313 milliseconds)
> - Reuse the parallelism of CoalescedShuffleReaderExec in 
> LocalShuffleReaderExec (265 milliseconds)
> - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds)
> - Empty stage coalesced to 0-partition RDD (514 milliseconds)
> - Scalar subquery (406 milliseconds)
> - Scalar subquery in later stages (500 milliseconds)
> - multiple joins *** FAILED *** (739 milliseconds)
>   ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft
>   :- BroadcastQueryStage 5
>   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, 
> false] as bigint))), [id=#504817]
>   : +- CustomShuffleReader local
>   :+- ShuffleQueryStage 4
>   :   +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777]
>   :  +- *(7) SortMergeJoin [key#251418], [a#251428], Inner
>   : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0
>   : :  +- CustomShuffleReader coalesced
>   : : +- ShuffleQueryStage 0
>   : :+- Exchange hashpartitioning(key#251418, 5), 
> true, [id=#504656]
>   : :   +- *(1) Filter (isnotnull(value#251419) AND 
> (cast(value#251419 as int) = 1))
>   : :  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) 
> AS value#251419]
>   : : +- Scan[obj#251417]
>   : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0
>   :+- CustomShuffleReader coalesced
>   :   +- ShuffleQueryStage 1
>   :  +- Exchange hashpartitioning(a#251428, 5), true, 
> [id=#504663]
>   : +- *(2) Filter (b#251429 = 1)
>   :+- *(2) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, 
> knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429]
>   :   +- Scan[obj#251427]
>   +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>  :- CustomShuffleReader local
>  :  +- ShuffleQueryStage 2
>  : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>  :+- *(3) Filter (n#251498 = 1)
>  :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) 
> AS l#251499]
>  :  +- Scan[obj#251497]
>  +- BroadcastQueryStage 6
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
> int, false] as bigint))), [id=#504830]
>+- CustomShuffleReader local
>   +- ShuffleQueryStage 3
>  +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694]
> +- *(4) Filter (a#251438 = 1)
>+- *(4) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, 
> unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439]
>   +- Scan[obj#251437]
>   , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>   :- CustomShuffleReader local
>   :  +- ShuffleQueryStage 2
>   : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>   :+- *(3) Filter (n#251498 = 1)
>   :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> 

[jira] [Updated] (SPARK-35540) Make config maxShuffledHashJoinLocalMapThreshold fallback to advisoryPartitionSizeInBytes

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35540:
--
Parent: (was: SPARK-33828)
Issue Type: Bug  (was: Sub-task)

> Make config maxShuffledHashJoinLocalMapThreshold fallback to 
> advisoryPartitionSizeInBytes
> -
>
> Key: SPARK-35540
> URL: https://issues.apache.org/jira/browse/SPARK-35540
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Major
>
> Make config `spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold` 
> fallback to `spark.sql.adaptive.advisoryPartitionSizeInBytes` so that we can 
> convert SMJ to SHJ in AQE by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32304:
--
Parent: (was: SPARK-33828)
Issue Type: Bug  (was: Sub-task)

> Flaky Test: AdaptiveQueryExecSuite.multiple joins
> -
>
> Key: SPARK-32304
> URL: https://issues.apache.org/jira/browse/SPARK-32304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> AdaptiveQueryExecSuite:
> - Change merge join to broadcast join (313 milliseconds)
> - Reuse the parallelism of CoalescedShuffleReaderExec in 
> LocalShuffleReaderExec (265 milliseconds)
> - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds)
> - Empty stage coalesced to 0-partition RDD (514 milliseconds)
> - Scalar subquery (406 milliseconds)
> - Scalar subquery in later stages (500 milliseconds)
> - multiple joins *** FAILED *** (739 milliseconds)
>   ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft
>   :- BroadcastQueryStage 5
>   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, 
> false] as bigint))), [id=#504817]
>   : +- CustomShuffleReader local
>   :+- ShuffleQueryStage 4
>   :   +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777]
>   :  +- *(7) SortMergeJoin [key#251418], [a#251428], Inner
>   : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0
>   : :  +- CustomShuffleReader coalesced
>   : : +- ShuffleQueryStage 0
>   : :+- Exchange hashpartitioning(key#251418, 5), 
> true, [id=#504656]
>   : :   +- *(1) Filter (isnotnull(value#251419) AND 
> (cast(value#251419 as int) = 1))
>   : :  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) 
> AS value#251419]
>   : : +- Scan[obj#251417]
>   : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0
>   :+- CustomShuffleReader coalesced
>   :   +- ShuffleQueryStage 1
>   :  +- Exchange hashpartitioning(a#251428, 5), true, 
> [id=#504663]
>   : +- *(2) Filter (b#251429 = 1)
>   :+- *(2) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, 
> knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429]
>   :   +- Scan[obj#251427]
>   +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>  :- CustomShuffleReader local
>  :  +- ShuffleQueryStage 2
>  : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>  :+- *(3) Filter (n#251498 = 1)
>  :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) 
> AS l#251499]
>  :  +- Scan[obj#251497]
>  +- BroadcastQueryStage 6
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
> int, false] as bigint))), [id=#504830]
>+- CustomShuffleReader local
>   +- ShuffleQueryStage 3
>  +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694]
> +- *(4) Filter (a#251438 = 1)
>+- *(4) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, 
> unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439]
>   +- Scan[obj#251437]
>   , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight
>   :- CustomShuffleReader local
>   :  +- ShuffleQueryStage 2
>   : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680]
>   :+- *(3) Filter (n#251498 = 1)
>   :   +- *(3) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, 
> 

[jira] [Updated] (SPARK-35414) Completely fix the broadcast timeout issue in AQE

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35414:
--
Parent: SPARK-37063
Issue Type: Sub-task  (was: Bug)

> Completely fix the broadcast timeout issue in AQE
> -
>
> Key: SPARK-35414
> URL: https://issues.apache.org/jira/browse/SPARK-35414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Assignee: Yu Zhong
>Priority: Major
>
> SPARK-33933 report a issue that in AQE, when the resources is limited, 
> broadcast timeout could happened. 
> [#31269|https://github.com/apache/spark/pull/31269] gives a partial fix by 
> reorder newStages by class type to make sure BroadcastQueryState precede 
> others when calling materialized(). However, it only guarantee that the order 
> of task to be scheduled in normal circumstances, but, the guarantee is not 
> strict since the submit of broadcast job and shuffle map job are in different 
> thread.
> So we need a completely fix to avoid the edge case triggering broadcast 
> timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35414) Completely fix the broadcast timeout issue in AQE

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35414:
--
Parent: (was: SPARK-33828)
Issue Type: Bug  (was: Sub-task)

> Completely fix the broadcast timeout issue in AQE
> -
>
> Key: SPARK-35414
> URL: https://issues.apache.org/jira/browse/SPARK-35414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Assignee: Yu Zhong
>Priority: Major
>
> SPARK-33933 report a issue that in AQE, when the resources is limited, 
> broadcast timeout could happened. 
> [#31269|https://github.com/apache/spark/pull/31269] gives a partial fix by 
> reorder newStages by class type to make sure BroadcastQueryState precede 
> others when calling materialized(). However, it only guarantee that the order 
> of task to be scheduled in normal circumstances, but, the guarantee is not 
> strict since the submit of broadcast job and shuffle map job are in different 
> thread.
> So we need a completely fix to avoid the edge case triggering broadcast 
> timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34827:
--
Parent: (was: SPARK-33828)
Issue Type: Bug  (was: Sub-task)

> Support fetching shuffle blocks in batch with i/o encryption
> 
>
> Key: SPARK-34827
> URL: https://issues.apache.org/jira/browse/SPARK-34827
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34827:
--
Parent: SPARK-37063
Issue Type: Sub-task  (was: Bug)

> Support fetching shuffle blocks in batch with i/o encryption
> 
>
> Key: SPARK-34827
> URL: https://issues.apache.org/jira/browse/SPARK-34827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37063) SQL Adaptive Query Execution QA: Phase 2

2021-10-19 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-37063:
-

 Summary: SQL Adaptive Query Execution QA: Phase 2
 Key: SPARK-37063
 URL: https://issues.apache.org/jira/browse/SPARK-37063
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36913) Implement createIndex and IndexExists in JDBC (MySQL dialect)

2021-10-19 Thread Reynold Xin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430780#comment-17430780
 ] 

Reynold Xin commented on SPARK-36913:
-

My concern is not about JDBC (I should've commented on the parent ticket). My 
concern is that there are *a lot* of RDBMS features and we can't possibly 
support all of them. It seems like we'd be much better off just having a 
generic fallback API to execute a command that's passed through by Spark to the 
underlying data source, and then the underlying data source can decide what to 
do. Otherwise we will have to add create index, define foreign key, define 
sequence objects, and 50 other DDL commands in Spark.

 

 

> Implement createIndex and IndexExists in JDBC (MySQL dialect)
> -
>
> Key: SPARK-36913
> URL: https://issues.apache.org/jira/browse/SPARK-36913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36913) Implement createIndex and IndexExists in JDBC (MySQL dialect)

2021-10-19 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430778#comment-17430778
 ] 

Huaxin Gao commented on SPARK-36913:


[~rxin]  Hi Reynold, thank you for taking a look into this and sharing your 
concerns. Are you concerning about the interface SupportsIndex or the JDBC 
implementation? The interface I added is very generic and it's up to the data 
source to implement it. The JDBC implementation just provides me a simple way 
to do an end to end testing so I don't have to implement the new interface in 
InMemoryTable to do the test.

> Implement createIndex and IndexExists in JDBC (MySQL dialect)
> -
>
> Key: SPARK-36913
> URL: https://issues.apache.org/jira/browse/SPARK-36913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29497) Cannot assign instance of java.lang.invoke.SerializedLambda to field

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29497:
--
Affects Version/s: 3.2.0

> Cannot assign instance of java.lang.invoke.SerializedLambda to field
> 
>
> Key: SPARK-29497
> URL: https://issues.apache.org/jira/browse/SPARK-29497
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3, 3.0.1, 3.2.0
> Environment: Spark 2.4.3 Scala 2.12
>Reporter: Rob Russo
>Priority: Major
>
> Note this is for scala 2.12:
> There seems to be an issue in spark with serializing a udf that is created 
> from a function assigned to a class member that references another function 
> assigned to a class member. This is similar to 
> https://issues.apache.org/jira/browse/SPARK-25047 but it looks like the 
> resolution has an issue with this case. After trimming it down to the base 
> issue I came up with the following to reproduce:
>  
>  
> {code:java}
> object TestLambdaShell extends Serializable {
>   val hello: String => String = s => s"hello $s!"  
>   val lambdaTest: String => String = hello( _ )  
>   def functionTest: String => String = hello( _ )
> }
> val hello = udf( TestLambdaShell.hello )
> val functionTest = udf( TestLambdaShell.functionTest )
> val lambdaTest = udf( TestLambdaShell.lambdaTest )
> sc.parallelize(Seq("world"),1).toDF("test").select(hello($"test")).show(1)
> sc.parallelize(Seq("world"),1).toDF("test").select(functionTest($"test")).show(1)
> sc.parallelize(Seq("world"),1).toDF("test").select(lambdaTest($"test")).show(1)
> {code}
>  
> All of which works except the last line which results in an exception on the 
> executors:
>  
> {code:java}
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$.lambdaTest of type 
> scala.Function1 in instance of 
> $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
>   at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> 

[jira] [Updated] (SPARK-29497) Cannot assign instance of java.lang.invoke.SerializedLambda to field

2021-10-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29497:
--
Environment: 
Spark 2.4.3 Scala 2.12
Spark 3.2.0 Scala 2.13.5 (Java 11.0.12)

  was:Spark 2.4.3 Scala 2.12


> Cannot assign instance of java.lang.invoke.SerializedLambda to field
> 
>
> Key: SPARK-29497
> URL: https://issues.apache.org/jira/browse/SPARK-29497
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3, 3.0.1, 3.2.0
> Environment: Spark 2.4.3 Scala 2.12
> Spark 3.2.0 Scala 2.13.5 (Java 11.0.12)
>Reporter: Rob Russo
>Priority: Major
>
> Note this is for scala 2.12:
> There seems to be an issue in spark with serializing a udf that is created 
> from a function assigned to a class member that references another function 
> assigned to a class member. This is similar to 
> https://issues.apache.org/jira/browse/SPARK-25047 but it looks like the 
> resolution has an issue with this case. After trimming it down to the base 
> issue I came up with the following to reproduce:
>  
>  
> {code:java}
> object TestLambdaShell extends Serializable {
>   val hello: String => String = s => s"hello $s!"  
>   val lambdaTest: String => String = hello( _ )  
>   def functionTest: String => String = hello( _ )
> }
> val hello = udf( TestLambdaShell.hello )
> val functionTest = udf( TestLambdaShell.functionTest )
> val lambdaTest = udf( TestLambdaShell.lambdaTest )
> sc.parallelize(Seq("world"),1).toDF("test").select(hello($"test")).show(1)
> sc.parallelize(Seq("world"),1).toDF("test").select(functionTest($"test")).show(1)
> sc.parallelize(Seq("world"),1).toDF("test").select(lambdaTest($"test")).show(1)
> {code}
>  
> All of which works except the last line which results in an exception on the 
> executors:
>  
> {code:java}
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> java.lang.invoke.SerializedLambda to field 
> $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$.lambdaTest of type 
> scala.Function1 in instance of 
> $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
>   at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at 

[jira] [Commented] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch

2021-10-19 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430761#comment-17430761
 ] 

Jungtaek Lim commented on SPARK-37062:
--

will submit a PR for this sooner.

> Introduce a new data source for providing consistent set of rows per 
> microbatch
> ---
>
> Key: SPARK-37062
> URL: https://issues.apache.org/jira/browse/SPARK-37062
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> The "rate" data source has been known to be used as a benchmark for streaming 
> query.
> While this helps to put the query to the limit (how many rows the query could 
> process per second), the rate data source doesn't provide consistent rows per 
> batch into stream, which leads two environments be hard to compare with.
> For example, in many cases, you may want to compare the metrics in the 
> batches between test environments (like running same streaming query with 
> different options). These metrics are strongly affected if the distribution 
> of input rows in batches are changing, especially a micro-batch has been 
> lagged (in any reason) and rate data source produces more input rows to the 
> next batch.
> Also, when you test against streaming aggregation, you may want the data 
> source produces the same set of input rows per batch (deterministic), so that 
> you can plan how these input rows will be aggregated and how state rows will 
> be evicted, and craft the test query based on the plan.
> The requirements of new data source would follow:
> * it should produce a specific number of input rows as requested
> * it should also include a timestamp (event time) into each row
> ** to make the input rows fully deterministic, timestamp should be configured 
> as well (like start timestamp & amount of advance per batch)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch

2021-10-19 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-37062:


 Summary: Introduce a new data source for providing consistent set 
of rows per microbatch
 Key: SPARK-37062
 URL: https://issues.apache.org/jira/browse/SPARK-37062
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: Jungtaek Lim


The "rate" data source has been known to be used as a benchmark for streaming 
query.

While this helps to put the query to the limit (how many rows the query could 
process per second), the rate data source doesn't provide consistent rows per 
batch into stream, which leads two environments be hard to compare with.

For example, in many cases, you may want to compare the metrics in the batches 
between test environments (like running same streaming query with different 
options). These metrics are strongly affected if the distribution of input rows 
in batches are changing, especially a micro-batch has been lagged (in any 
reason) and rate data source produces more input rows to the next batch.

Also, when you test against streaming aggregation, you may want the data source 
produces the same set of input rows per batch (deterministic), so that you can 
plan how these input rows will be aggregated and how state rows will be 
evicted, and craft the test query based on the plan.

The requirements of new data source would follow:

* it should produce a specific number of input rows as requested
* it should also include a timestamp (event time) into each row
** to make the input rows fully deterministic, timestamp should be configured 
as well (like start timestamp & amount of advance per batch)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37061) Custom V2 Metrics uses wrong classname for lookup

2021-10-19 Thread Russell Spitzer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Spitzer updated SPARK-37061:

Description: 
Currently CustomMetrics uses `getCanonicalName` to get the metric type name

https://github.com/apache/spark/blob/38493401d18d42a6cb176bf515536af97ba1338b/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala#L31-L33

But when using reflection we need to use the original type name.

Here is an example when working with an inner class

{code:java title="CanonicalName vs Name"}
Class.getName = 
org.apache.iceberg.spark.source.SparkBatchScan$FilesScannedMetric
Class.getCanonicalName =
org.apache.iceberg.spark.source.SparkBatchScan.FilesScannedMetric
{code}


The "$" name is required to look up this class while the "." version will fail 
with CNF.

  was:
Currently CustomMetrics uses `getCanonicalName` to get the metric type name

https://github.com/apache/spark/blob/38493401d18d42a6cb176bf515536af97ba1338b/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala#L31-L33

But when using reflection we need to use the original type name.

Here is an example when working with an inner class

```
Class.getName = 
org.apache.iceberg.spark.source.SparkBatchScan$FilesScannedMetric
Class.getCanonicalName =
org.apache.iceberg.spark.source.SparkBatchScan.FilesScannedMetric
```

The "$" name is required to look up this class while the "." version will fail 
with CNF.


> Custom V2 Metrics uses wrong classname for lookup
> -
>
> Key: SPARK-37061
> URL: https://issues.apache.org/jira/browse/SPARK-37061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Currently CustomMetrics uses `getCanonicalName` to get the metric type name
> https://github.com/apache/spark/blob/38493401d18d42a6cb176bf515536af97ba1338b/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala#L31-L33
> But when using reflection we need to use the original type name.
> Here is an example when working with an inner class
> {code:java title="CanonicalName vs Name"}
> Class.getName = 
> org.apache.iceberg.spark.source.SparkBatchScan$FilesScannedMetric
> Class.getCanonicalName =
> org.apache.iceberg.spark.source.SparkBatchScan.FilesScannedMetric
> {code}
> The "$" name is required to look up this class while the "." version will 
> fail with CNF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37061) Custom V2 Metrics uses wrong classname in for lookup

2021-10-19 Thread Russell Spitzer (Jira)
Russell Spitzer created SPARK-37061:
---

 Summary: Custom V2 Metrics uses wrong classname in for lookup
 Key: SPARK-37061
 URL: https://issues.apache.org/jira/browse/SPARK-37061
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Russell Spitzer


Currently CustomMetrics uses `getCanonicalName` to get the metric type name

https://github.com/apache/spark/blob/38493401d18d42a6cb176bf515536af97ba1338b/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala#L31-L33

But when using reflection we need to use the original type name.

Here is an example when working with an inner class

```
Class.getName = 
org.apache.iceberg.spark.source.SparkBatchScan$FilesScannedMetric
Class.getCanonicalName =
org.apache.iceberg.spark.source.SparkBatchScan.FilesScannedMetric
```

The "$" name is required to look up this class while the "." version will fail 
with CNF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37061) Custom V2 Metrics uses wrong classname for lookup

2021-10-19 Thread Russell Spitzer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Spitzer updated SPARK-37061:

Summary: Custom V2 Metrics uses wrong classname for lookup  (was: Custom V2 
Metrics uses wrong classname in for lookup)

> Custom V2 Metrics uses wrong classname for lookup
> -
>
> Key: SPARK-37061
> URL: https://issues.apache.org/jira/browse/SPARK-37061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Russell Spitzer
>Priority: Minor
>
> Currently CustomMetrics uses `getCanonicalName` to get the metric type name
> https://github.com/apache/spark/blob/38493401d18d42a6cb176bf515536af97ba1338b/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala#L31-L33
> But when using reflection we need to use the original type name.
> Here is an example when working with an inner class
> ```
> Class.getName = 
> org.apache.iceberg.spark.source.SparkBatchScan$FilesScannedMetric
> Class.getCanonicalName =
> org.apache.iceberg.spark.source.SparkBatchScan.FilesScannedMetric
> ```
> The "$" name is required to look up this class while the "." version will 
> fail with CNF.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37044) Add Row to __all__ in pyspark.sql.types

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430675#comment-17430675
 ] 

Apache Spark commented on SPARK-37044:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34332

> Add Row to __all__ in pyspark.sql.types
> ---
>
> Key: SPARK-37044
> URL: https://issues.apache.org/jira/browse/SPARK-37044
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from 
> {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import 
> *}} won't import {{Row}}.
> It might be counter-intuitive, especially when we import {{Row}} from 
> {{types}} in {{examples}}.
> Should we add it to {{__all__}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37044) Add Row to __all__ in pyspark.sql.types

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37044:


Assignee: (was: Apache Spark)

> Add Row to __all__ in pyspark.sql.types
> ---
>
> Key: SPARK-37044
> URL: https://issues.apache.org/jira/browse/SPARK-37044
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from 
> {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import 
> *}} won't import {{Row}}.
> It might be counter-intuitive, especially when we import {{Row}} from 
> {{types}} in {{examples}}.
> Should we add it to {{__all__}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37044) Add Row to __all__ in pyspark.sql.types

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430673#comment-17430673
 ] 

Apache Spark commented on SPARK-37044:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34332

> Add Row to __all__ in pyspark.sql.types
> ---
>
> Key: SPARK-37044
> URL: https://issues.apache.org/jira/browse/SPARK-37044
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from 
> {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import 
> *}} won't import {{Row}}.
> It might be counter-intuitive, especially when we import {{Row}} from 
> {{types}} in {{examples}}.
> Should we add it to {{__all__}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37044) Add Row to __all__ in pyspark.sql.types

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37044:


Assignee: Apache Spark

> Add Row to __all__ in pyspark.sql.types
> ---
>
> Key: SPARK-37044
> URL: https://issues.apache.org/jira/browse/SPARK-37044
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from 
> {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import 
> *}} won't import {{Row}}.
> It might be counter-intuitive, especially when we import {{Row}} from 
> {{types}} in {{examples}}.
> Should we add it to {{__all__}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2021-10-19 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-33152:
-
Description: 
h2. Q1. What are you trying to do? Articulate your objectives using absolutely 
no jargon.

Proposing new algorithm to create, store and use constraints for removing 
redundant filters & inferring new filters.
The current algorithm has subpar performance in complex expression scenarios 
involving aliases( with certain use cases the compilation time can go into 
hours), potential to cause OOM,  may miss removing redundant filters in 
different scenarios, may miss creating IsNotNull constraints in different 
scenarios,  does not push compound predicates in Join.

# This issue if not fixed can cause OutOfMemory issue or unacceptable query 
compilation times.
Have added  a test "plan equivalence with case statements and performance 
comparison with benefit of more than 10x conservatively" in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. *With 
this PR the compilation time is 247 ms vs 13958 ms without the change*
# It is more effective in filter pruning as is evident in some of the tests in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite where 
current code is not able to identify the redundant filter in some cases.
# It is able to generate a better optimized plan for join queries as it can 
push compound predicates.
# The current logic can miss a lot of possible cases of removing redundant 
predicates, as it fails to take into account if same attribute or its aliases 
are repeated multiple times in a complex expression.
# There are cases where some of the optimizer rules involving removal of 
redundant predicates fail to remove on the basis of constraint data. In some 
cases the rule works, just by the virtue of previous rules helping it out to 
cover the inaccuracy. That the ConstraintPropagation rule & its function of 
removal of redundant filters & addition of new inferred filters is dependent on 
the working of some of the other unrelated previous optimizer rules is 
behaving, is indicative of issues.
# It does away with all the EqualNullSafe constraints as this logic does not 
need those constraints to be created.
# There is at least one test in existing ConstraintPropagationSuite which is 
missing a IsNotNull constraints because the code incorrectly generated a 
EqualsNullSafeConstraint instead of EqualTo constraint, when using the existing 
Constraints code. With these changes, the test correctly creates an EqualTo 
constraint, resulting in an inferred IsNotNull constraint
# It does away with the current combinatorial logic of evaluation all the 
constraints can cause compilation to run into hours or cause OOM. The number of 
constraints stored is exactly the same as the number of filters encountered

h2. Q2. What problem is this proposal NOT designed to solve?
It mainly focuses on compile time performance, but in some cases can benefit 
run time characteristics too, like inferring IsNotNull filter or pushing down 
compound predicates on the join, which currently may get missed/ does not 
happen , respectively, by the present code.

h2. Q3. How is it done today, and what are the limits of current practice?
Current ConstraintsPropagation code,  pessimistically tries to generates all 
the possible combinations of constraints , based on the aliases ( even then it 
may miss a lot of  combinations if the expression is a complex expression 
involving same attribute repeated multiple times within the expression and 
there are many aliases to that column).  There are query plans in our 
production env, which can result in intermediate number of constraints going 
into hundreds of thousands, causing OOM or taking time running into hours.  
Also there are cases where it incorrectly generates an EqualNullSafe constraint 
instead of EqualTo constraint , thus missing a possible IsNull constraint on 
column. 
Also it only pushes single column predicate on the other side of the join.
The constraints generated , in some cases, are missing the required ones, and 
the plan apparently is behaving correctly only due to the preceding unrelated 
optimizer rule. Have Test which show that with the bare mnimum rules containing 
RemoveRedundantPredicate, it misses the removal of redundant predicate.

h2. Q4. What is new in your approach and why do you think it will be successful?
It solves all the above mentioned issues. 
# The number of constraints created are same as the number of filters.  No 
combinatorial creation of constraints. No need for EqualsNullSafe constraint on 
aliases.
#  Can remove redundant predicates on any expression involving aliases 
irrespective of the number of repeat occurences in all possible combination.
# Brings down query compilation time to few minutes from hours.
# Can push compound predicates on Joins & infer right number of IsNotNull 
constraints 

[jira] [Updated] (SPARK-37060) Report driver status does not handle response from backup masters

2021-10-19 Thread Mohamadreza Rostami (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohamadreza Rostami updated SPARK-37060:

Description: After an improvement in SPARK-31486, contributor uses 
'asyncSendToMasterAndForwardReply' method instead of 
'activeMasterEndpoint.askSync' to get the status of driver. Since the driver's 
status is only available in active master and the 
'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
have to handle the response from the backup masters in the client, which the 
developer did not consider in the SPARK-31486 change. So drivers running in 
cluster mode and on a cluster with multi-master affected by this bug. I created 
the patch for this bug and will soon be sent the pull request.  (was: After a 
improvement in SPARK-31486, contributor uses 'asyncSendToMasterAndForwardReply' 
method instead of 'activeMasterEndpoint.askSync' to get the status of driver. 
Since the driver's status is only available inactive master and the 
'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
have to handle the response from the backup masters in the client, which the 
developer did not consider in the SPARK-31486 change. So drivers running in 
cluster mode and on a cluster with multi-master affected by this bug. I created 
the patch for this bug and will soon be sent the pull request.)

> Report driver status does not handle response from backup masters
> -
>
> Key: SPARK-37060
> URL: https://issues.apache.org/jira/browse/SPARK-37060
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
>Reporter: Mohamadreza Rostami
>Priority: Major
>
> After an improvement in SPARK-31486, contributor uses 
> 'asyncSendToMasterAndForwardReply' method instead of 
> 'activeMasterEndpoint.askSync' to get the status of driver. Since the 
> driver's status is only available in active master and the 
> 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
> have to handle the response from the backup masters in the client, which the 
> developer did not consider in the SPARK-31486 change. So drivers running in 
> cluster mode and on a cluster with multi-master affected by this bug. I 
> created the patch for this bug and will soon be sent the pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2021-10-19 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-33152:
-
Description: 
h2. Q1. What are you trying to do? Articulate your objectives using absolutely 
no jargon.

Proposing new algorithm to create, store and use constraints for removing 
redundant filters & inferring new filters.
The current algorithm has subpar performance in complex expression scenarios 
involving aliases( with certain use cases the compilation time can go into 
hours), potential to cause OOM,  may miss removing redundant filters in 
different scenarios, may miss creating IsNotNull constraints in different 
scenarios,  does not push compound predicates in Join.

# This issue if not fixed can cause OutOfMemory issue or unacceptable query 
compilation times.
Have added  a test "plan equivalence with case statements and performance 
comparison with benefit of more than 10x conservatively" in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. With 
this PR the compilation time is 247 ms vs 13958 ms without the change
# It is more effective in filter pruning as is evident in some of the tests in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite where 
current code is not able to identify the redundant filter in some cases.
# It is able to generate a better optimized plan for join queries as it can 
push compound predicates.
# The current logic can miss a lot of possible cases of removing redundant 
predicates, as it fails to take into account if same attribute or its aliases 
are repeated multiple times in a complex expression.
# There are cases where some of the optimizer rules involving removal of 
redundant predicates fail to remove on the basis of constraint data. In some 
cases the rule works, just by the virtue of previous rules helping it out to 
cover the inaccuracy. That the ConstraintPropagation rule & its function of 
removal of redundant filters & addition of new inferred filters is dependent on 
the working of some of the other unrelated previous optimizer rules is 
behaving, is indicative of issues.
# It does away with all the EqualNullSafe constraints as this logic does not 
need those constraints to be created.
# There is at least one test in existing ConstraintPropagationSuite which is 
missing a IsNotNull constraints because the code incorrectly generated a 
EqualsNullSafeConstraint instead of EqualTo constraint, when using the existing 
Constraints code. With these changes, the test correctly creates an EqualTo 
constraint, resulting in an inferred IsNotNull constraint
# It does away with the current combinatorial logic of evaluation all the 
constraints can cause compilation to run into hours or cause OOM. The number of 
constraints stored is exactly the same as the number of filters encountered

h2. Q2. What problem is this proposal NOT designed to solve?
It mainly focuses on compile time performance, but in some cases can benefit 
run time characteristics too, like inferring IsNotNull filter or pushing down 
compound predicates on the join, which currently may get missed/ does not 
happen , respectively, by the present code.

h2. Q3. How is it done today, and what are the limits of current practice?
Current ConstraintsPropagation code,  pessimistically tries to generates all 
the possible combinations of constraints , based on the aliases ( even then it 
may miss a lot of  combinations if the expression is a complex expression 
involving same attribute repeated multiple times within the expression and 
there are many aliases to that column).  There are query plans in our 
production env, which can result in intermediate number of constraints going 
into hundreds of thousands, causing OOM or taking time running into hours.  
Also there are cases where it incorrectly generates an EqualNullSafe constraint 
instead of EqualTo constraint , thus missing a possible IsNull constraint on 
column. 
Also it only pushes single column predicate on the other side of the join.
The constraints generated , in some cases, are missing the required ones, and 
the plan apparently is behaving correctly only due to the preceding unrelated 
optimizer rule. Have Test which show that with the bare mnimum rules containing 
RemoveRedundantPredicate, it misses the removal of redundant predicate.

h2. Q4. What is new in your approach and why do you think it will be successful?
It solves all the above mentioned issues. 
# The number of constraints created are same as the number of filters.  No 
combinatorial creation of constraints. No need for EqualsNullSafe constraint on 
aliases.
#  Can remove redundant predicates on any expression involving aliases 
irrespective of the number of repeat occurences in all possible combination.
# Brings down query compilation time to few minutes from hours.
# Can push compound predicates on Joins & infer right number of IsNotNull 
constraints 

[jira] [Assigned] (SPARK-37060) Report driver status does not handle response from backup masters

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37060:


Assignee: (was: Apache Spark)

> Report driver status does not handle response from backup masters
> -
>
> Key: SPARK-37060
> URL: https://issues.apache.org/jira/browse/SPARK-37060
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
>Reporter: Mohamadreza Rostami
>Priority: Major
>
> After a improvement in SPARK-31486, contributor uses 
> 'asyncSendToMasterAndForwardReply' method instead of 
> 'activeMasterEndpoint.askSync' to get the status of driver. Since the 
> driver's status is only available inactive master and the 
> 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
> have to handle the response from the backup masters in the client, which the 
> developer did not consider in the SPARK-31486 change. So drivers running in 
> cluster mode and on a cluster with multi-master affected by this bug. I 
> created the patch for this bug and will soon be sent the pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37060) Report driver status does not handle response from backup masters

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430613#comment-17430613
 ] 

Apache Spark commented on SPARK-37060:
--

User 'mohamadrezarostami' has created a pull request for this issue:
https://github.com/apache/spark/pull/34331

> Report driver status does not handle response from backup masters
> -
>
> Key: SPARK-37060
> URL: https://issues.apache.org/jira/browse/SPARK-37060
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
>Reporter: Mohamadreza Rostami
>Priority: Major
>
> After a improvement in SPARK-31486, contributor uses 
> 'asyncSendToMasterAndForwardReply' method instead of 
> 'activeMasterEndpoint.askSync' to get the status of driver. Since the 
> driver's status is only available inactive master and the 
> 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
> have to handle the response from the backup masters in the client, which the 
> developer did not consider in the SPARK-31486 change. So drivers running in 
> cluster mode and on a cluster with multi-master affected by this bug. I 
> created the patch for this bug and will soon be sent the pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37060) Report driver status does not handle response from backup masters

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37060:


Assignee: Apache Spark

> Report driver status does not handle response from backup masters
> -
>
> Key: SPARK-37060
> URL: https://issues.apache.org/jira/browse/SPARK-37060
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
>Reporter: Mohamadreza Rostami
>Assignee: Apache Spark
>Priority: Major
>
> After a improvement in SPARK-31486, contributor uses 
> 'asyncSendToMasterAndForwardReply' method instead of 
> 'activeMasterEndpoint.askSync' to get the status of driver. Since the 
> driver's status is only available inactive master and the 
> 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
> have to handle the response from the backup masters in the client, which the 
> developer did not consider in the SPARK-31486 change. So drivers running in 
> cluster mode and on a cluster with multi-master affected by this bug. I 
> created the patch for this bug and will soon be sent the pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37060) Report driver status does not handle response from backup masters

2021-10-19 Thread Mohamadreza Rostami (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohamadreza Rostami updated SPARK-37060:

Description: After a improvement in SPARK-31486, contributor uses 
'asyncSendToMasterAndForwardReply' method instead of 
'activeMasterEndpoint.askSync' to get the status of driver. Since the driver's 
status is only available inactive master and the 
'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
have to handle the response from the backup masters in the client, which the 
developer did not consider in the SPARK-31486 change. So drivers running in 
cluster mode and on a cluster with multi-master affected by this bug. I created 
the patch for this bug and will soon be sent the pull request.  (was: After a 
improvement in SPARK-31486, contributor uses 'asyncSendToMasterAndForwardReply' 
method instead of 'activeMasterEndpoint.askSync' to get the status of driver. 
Since the driver's status is only available inactive master and the 
'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
have to handle the response from the backup masters in the client, which the 
developer did not consider in the SPARK-31486 change. So drivers running in 
cluster mode and on a cluster with multi-master affected by this bug.)

> Report driver status does not handle response from backup masters
> -
>
> Key: SPARK-37060
> URL: https://issues.apache.org/jira/browse/SPARK-37060
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
>Reporter: Mohamadreza Rostami
>Priority: Major
>
> After a improvement in SPARK-31486, contributor uses 
> 'asyncSendToMasterAndForwardReply' method instead of 
> 'activeMasterEndpoint.askSync' to get the status of driver. Since the 
> driver's status is only available inactive master and the 
> 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
> have to handle the response from the backup masters in the client, which the 
> developer did not consider in the SPARK-31486 change. So drivers running in 
> cluster mode and on a cluster with multi-master affected by this bug. I 
> created the patch for this bug and will soon be sent the pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37060) Report driver status does not handle response from backup masters

2021-10-19 Thread Mohamadreza Rostami (Jira)
Mohamadreza Rostami created SPARK-37060:
---

 Summary: Report driver status does not handle response from backup 
masters
 Key: SPARK-37060
 URL: https://issues.apache.org/jira/browse/SPARK-37060
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0, 3.1.2, 3.1.1, 3.1.0
Reporter: Mohamadreza Rostami


After a improvement in SPARK-31486, contributor uses 
'asyncSendToMasterAndForwardReply' method instead of 
'activeMasterEndpoint.askSync' to get the status of driver. Since the driver's 
status is only available inactive master and the 
'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we 
have to handle the response from the backup masters in the client, which the 
developer did not consider in the SPARK-31486 change. So drivers running in 
cluster mode and on a cluster with multi-master affected by this bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36796) Make all unit tests pass on Java 17

2021-10-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36796.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34153
[https://github.com/apache/spark/pull/34153]

> Make all unit tests pass on Java 17
> ---
>
> Key: SPARK-36796
> URL: https://issues.apache.org/jira/browse/SPARK-36796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36796) Make all unit tests pass on Java 17

2021-10-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-36796:


Assignee: Yang Jie

> Make all unit tests pass on Java 17
> ---
>
> Key: SPARK-36796
> URL: https://issues.apache.org/jira/browse/SPARK-36796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37059:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Ensure the sort order of the output in the PySpark doctests
> ---
>
> Key: SPARK-37059
> URL: https://issues.apache.org/jira/browse/SPARK-37059
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> The collect_set builtin function doesn't ensure the sort order of its result 
> for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order 
> of the result rows.
> Nevertheless, their PySpark doctests assume a certain kind of sort order, 
> causing that such doctests fail with Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430551#comment-17430551
 ] 

Apache Spark commented on SPARK-37059:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34330

> Ensure the sort order of the output in the PySpark doctests
> ---
>
> Key: SPARK-37059
> URL: https://issues.apache.org/jira/browse/SPARK-37059
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> The collect_set builtin function doesn't ensure the sort order of its result 
> for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order 
> of the result rows.
> Nevertheless, their PySpark doctests assume a certain kind of sort order, 
> causing that such doctests fail with Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37059:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Ensure the sort order of the output in the PySpark doctests
> ---
>
> Key: SPARK-37059
> URL: https://issues.apache.org/jira/browse/SPARK-37059
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> The collect_set builtin function doesn't ensure the sort order of its result 
> for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order 
> of the result rows.
> Nevertheless, their PySpark doctests assume a certain kind of sort order, 
> causing that such doctests fail with Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests

2021-10-19 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37059:
---
Description: 
The collect_set builtin function doesn't ensure the sort order of its result 
for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order of 
the result rows.
Nevertheless, their PySpark doctests assume a certain kind of sort order, 
causing that such doctests fail with Scala 2.13.


  was:
The collect_set builtin function doesn't ensure the sort order of its result 
for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order of 
the result rows.
Nevertheless, their doctests for PySpark assume a certain kind of sort order, 
causing that such doctests fail with Scala 2.13.



> Ensure the sort order of the output in the PySpark doctests
> ---
>
> Key: SPARK-37059
> URL: https://issues.apache.org/jira/browse/SPARK-37059
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> The collect_set builtin function doesn't ensure the sort order of its result 
> for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order 
> of the result rows.
> Nevertheless, their PySpark doctests assume a certain kind of sort order, 
> causing that such doctests fail with Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests

2021-10-19 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37059:
---
Description: 
The collect_set builtin function doesn't ensure the sort order of its result 
for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order of 
the result rows.
Nevertheless, thier doctests for PySpark assume a certain kind of sort order, 
causing that such doctests fail with Scala 2.13.


  was:
The collect_set builtin function doesn't ensure the sort order of its result 
for each row. FPGrouthModel.freqItemsets also doesn' ensure the sort order of 
the result rows.
Nevertheless, thier doctests for PySpark assume a certain kind of sort order, 
causing that such doctests fail with Scala 2.13.



> Ensure the sort order of the output in the PySpark doctests
> ---
>
> Key: SPARK-37059
> URL: https://issues.apache.org/jira/browse/SPARK-37059
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> The collect_set builtin function doesn't ensure the sort order of its result 
> for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order 
> of the result rows.
> Nevertheless, thier doctests for PySpark assume a certain kind of sort order, 
> causing that such doctests fail with Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests

2021-10-19 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37059:
---
Description: 
The collect_set builtin function doesn't ensure the sort order of its result 
for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order of 
the result rows.
Nevertheless, their doctests for PySpark assume a certain kind of sort order, 
causing that such doctests fail with Scala 2.13.


  was:
The collect_set builtin function doesn't ensure the sort order of its result 
for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order of 
the result rows.
Nevertheless, thier doctests for PySpark assume a certain kind of sort order, 
causing that such doctests fail with Scala 2.13.



> Ensure the sort order of the output in the PySpark doctests
> ---
>
> Key: SPARK-37059
> URL: https://issues.apache.org/jira/browse/SPARK-37059
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> The collect_set builtin function doesn't ensure the sort order of its result 
> for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order 
> of the result rows.
> Nevertheless, their doctests for PySpark assume a certain kind of sort order, 
> causing that such doctests fail with Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests

2021-10-19 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37059:
---
Description: 
The collect_set builtin function doesn't ensure the sort order of its result 
for each row. FPGrouthModel.freqItemsets also doesn' ensure the sort order of 
the result rows.
Nevertheless, thier doctests for PySpark assume a certain kind of sort order, 
causing that such doctests fail with Scala 2.13.


  was:
The collect_set builtin function doesn't ensure the sort order of its result. 
FPGrouthModel.freqItemsets also doesn' ensure the sort order of the result rows.
Nevertheless, thier doctests for PySpark assume a certain kind of sort order, 
causing that such doctests fail with Scala 2.13.



> Ensure the sort order of the output in the PySpark doctests
> ---
>
> Key: SPARK-37059
> URL: https://issues.apache.org/jira/browse/SPARK-37059
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> The collect_set builtin function doesn't ensure the sort order of its result 
> for each row. FPGrouthModel.freqItemsets also doesn' ensure the sort order of 
> the result rows.
> Nevertheless, thier doctests for PySpark assume a certain kind of sort order, 
> causing that such doctests fail with Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests

2021-10-19 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37059:
---
Component/s: Tests

> Ensure the sort order of the output in the PySpark doctests
> ---
>
> Key: SPARK-37059
> URL: https://issues.apache.org/jira/browse/SPARK-37059
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> The collect_set builtin function doesn't ensure the sort order of its result. 
> FPGrouthModel.freqItemsets also doesn' ensure the sort order of the result 
> rows.
> Nevertheless, thier doctests for PySpark assume a certain kind of sort order, 
> causing that such doctests fail with Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests

2021-10-19 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37059:
---
Summary: Ensure the sort order of the output in the PySpark doctests  (was: 
Ensure the sort order of the output in the PySpark examples)

> Ensure the sort order of the output in the PySpark doctests
> ---
>
> Key: SPARK-37059
> URL: https://issues.apache.org/jira/browse/SPARK-37059
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> The collect_set builtin function doesn't ensure the sort order of its result. 
> FPGrouthModel.freqItemsets also doesn' ensure the sort order of the result 
> rows.
> Nevertheless, thier doctests for PySpark assume a certain kind of sort order, 
> causing that such doctests fail with Scala 2.13.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37059) Ensure the sort order of the output in the PySpark examples

2021-10-19 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-37059:
--

 Summary: Ensure the sort order of the output in the PySpark 
examples
 Key: SPARK-37059
 URL: https://issues.apache.org/jira/browse/SPARK-37059
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


The collect_set builtin function doesn't ensure the sort order of its result. 
FPGrouthModel.freqItemsets also doesn' ensure the sort order of the result rows.
Nevertheless, thier doctests for PySpark assume a certain kind of sort order, 
causing that such doctests fail with Scala 2.13.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37058) Add spark-shell command line unit test

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37058:


Assignee: (was: Apache Spark)

> Add spark-shell command line unit test
> --
>
> Key: SPARK-37058
> URL: https://issues.apache.org/jira/browse/SPARK-37058
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Add spark-shell command line unit test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37058) Add spark-shell command line unit test

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37058:


Assignee: Apache Spark

> Add spark-shell command line unit test
> --
>
> Key: SPARK-37058
> URL: https://issues.apache.org/jira/browse/SPARK-37058
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Add spark-shell command line unit test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37058) Add spark-shell command line unit test

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430484#comment-17430484
 ] 

Apache Spark commented on SPARK-37058:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34329

> Add spark-shell command line unit test
> --
>
> Key: SPARK-37058
> URL: https://issues.apache.org/jira/browse/SPARK-37058
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Add spark-shell command line unit test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Description: 
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type. 

Data is inserted by hive, and queried by Spark.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

Luckily, the varchar type is OK. 

 

This bug can be reproduced by a few steps.

>>> desc t2_orc;
 ++---+++
|col_name|data_type|comment|

++---+++
|a|string      |NULL|
|b|char(50)  |NULL|
|c|int            |NULL|

++---++--–+

>>> select * from t2_orc where a='a';
 +-+---++--+
|a|b|c|

+-+---++--+
|a|b|1|
|a|b|2|
|a|b|3|
|a|b|4|
|a|b|5|

+-+---++–+

>>> select * from t2_orc where b='b';
 +-+---++--+
|a|b|c|

+-+---++--+
 +-+---++--+

 

By the way, Spark's tests should add more cases on the char type.

 

== Physical Plan ==
 CollectLimit (3)
 +- Filter (2)
 +- Scan orc tpcds_bin_partitioned_orc_2.item (1)

(1) Scan orc tpcds_bin_partitioned_orc_2.item
 Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Batched: false
 Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
 PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
)]
 ReadSchema: 
struct

(2) Filter
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+

(3) CollectLimit
 Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, 
i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, 
i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, 
i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, 
i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, 
i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, 
i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, 
i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, 
i_product_name#21]
 Arguments: 100

 

  was:
When I try the following sample SQL on  the TPCDS data, the filter operator 
returns an empty row set (shown in web ui).

_select * from item where i_category = 'Music' limit 100;_

The table is in ORC format, and i_category is char(50) type.

I guest that the char(50) type will remains redundant blanks after the actual 
word.

It will affect the boolean value of  "x.equals(Y)", and results in wrong 
results.

Luckily, the varchar type is OK. 

 

This bug can be reproduced by a few steps.

>>> desc t2_orc;
+---++--+--+
| col_name | data_type | comment |
+---++--+--+
| a | string       | NULL |
| b | char(50)  | NULL |
| c | int            | NULL |
+---++--+–+

>>> select * from t2_orc where a='a';
++++--+
| a | b | c |
++++--+
| a | b | 1 |
| a | b | 2 |
| a | b | 3 |
| a | b | 4 |
| a | b | 5 |
++++–+

>>> select * from t2_orc where b='b';
++++--+
| a | b | c |
++++--+
++++--+

 

By the way, Spark's tests should add more cases on the 

[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430450#comment-17430450
 ] 

frankli commented on SPARK-37051:
-

It seems to be affected by the right padding.

[SPARK-34192][SQL] 
[https://github.com/apache/spark/commit/d1177b52304217f4cb86506fd1887ec98879ed16]

[~yaoqiang]

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
> +---++--+--+
> | col_name | data_type | comment |
> +---++--+--+
> | a | string       | NULL |
> | b | char(50)  | NULL |
> | c | int            | NULL |
> +---++--+–+
> >>> select * from t2_orc where a='a';
> ++++--+
> | a | b | c |
> ++++--+
> | a | b | 1 |
> | a | b | 2 |
> | a | b | 3 |
> | a | b | 4 |
> | a | b | 5 |
> ++++–+
> >>> select * from t2_orc where b='b';
> ++++--+
> | a | b | c |
> ++++--+
> ++++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37058) Add spark-shell command line unit test

2021-10-19 Thread angerszhu (Jira)
angerszhu created SPARK-37058:
-

 Summary: Add spark-shell command line unit test
 Key: SPARK-37058
 URL: https://issues.apache.org/jira/browse/SPARK-37058
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: angerszhu


Add spark-shell command line unit test




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33564) Prometheus metrics for Master and Worker isn't working

2021-10-19 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430425#comment-17430425
 ] 

Michał Wieleba commented on SPARK-33564:


Are there any plans to introduce Prometheus metrics for yarn cluster? Any 
update on this?

> Prometheus metrics for Master and Worker isn't working 
> ---
>
> Key: SPARK-33564
> URL: https://issues.apache.org/jira/browse/SPARK-33564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Paulo Roberto de Oliveira Castro
>Priority: Major
>  Labels: Metrics, metrics, prometheus
>
> Following the [PR|https://github.com/apache/spark/pull/25769] that introduced 
> the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}}  
> (also tested with 3.0.0), uncompressed the tgz and created a file called 
> {{metrics.properties}} adding this content:
> {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}}
>  {{*.sink.prometheusServlet.path=/metrics/prometheus}}
>  master.sink.prometheusServlet.path=/metrics/master/prometheus
>  applications.sink.prometheusServlet.path=/metrics/applications/prometheus
> {quote}
> Then I ran: 
> {quote}{{$ sbin/start-master.sh}}
>  {{$ sbin/start-slave.sh spark://`hostname`:7077}}
>  {{$ bin/spark-shell --master spark://`hostname`:7077 
> --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}}
> {quote}
> {{The Spark shell opens without problems:}}
> {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable}}
> {{Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties}}
> {{Setting default log level to "WARN".}}
> {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).}}
> {{Spark context Web UI available at 
> [http://192.168.0.6:4040|http://192.168.0.6:4040/]}}
> {{Spark context available as 'sc' (master = 
> spark://MacBook-Pro-de-Paulo-2.local:7077, app id = 
> app-20201125173618-0002).}}
> {{Spark session available as 'spark'.}}
> {{Welcome to}}
> {{                    __}}
> {{     / __/_   _/ /__}}
> {{    _\ \/ _ \/ _ `/ __/  '_/}}
> {{   /___/ .__/_,_/_/ /_/_\   version 3.0.0}}
> {{      /_/}}
> {{         }}
> {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}}
> {{Type in expressions to have them evaluated.}}
> {{Type :help for more information. }}
> {{scala>}}
> {quote}
> {{And when I try to fetch prometheus metrics for driver, everything works 
> fine:}}
> {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"}
>  0
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"}
>  732
> metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"}
>  0
> {quote}
> *The problem appears when I try accessing master metrics*, and I get the 
> following problem:
> {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}}
> {{}}
> {{      }}
> {{         type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')}}
> {{        }}
> {{         href="/static/spark-logo-77x50px-hd.png">}}
> {{        Spark Master at 
> spark://MacBook-Pro-de-Paulo-2.local:7077}}
> {{      }}
> {{      }}
> {{        }}
> {{          }}
> {{            }}
> {{              }}
> {{                }}
> {{                  }}
> {{                  3.0.0}}
> {{                }}
> {{                Spark Master at spark://MacBook-Pro-de-Paulo-2.local:7077}}
> {{              }}
> {{            }}
> {{          }}
> {{          }}
> {{          }}
> {{            }}
> {{              URL: 
> spark://MacBook-Pro-de-Paulo-2.local:7077}}
>  ...
> {quote}
> Instead of the metrics I'm getting an HTML page.  The same happens for all of 
> those here:
> {quote}{{$ curl -s [http://localhost:8080/metrics/applications/prometheus/]}}
>  {{$ curl -s 

[jira] [Resolved] (SPARK-37057) Fix wrong DocSearch facet filter in release-tag.sh

2021-10-19 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-37057.

Fix Version/s: 3.2.1
   3.3.0
   Resolution: Fixed

Issue resolved by pull request 34328
[https://github.com/apache/spark/pull/34328]

> Fix wrong DocSearch facet filter in release-tag.sh
> --
>
> Key: SPARK-37057
> URL: https://issues.apache.org/jira/browse/SPARK-37057
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.3.0, 3.2.1
>
>
> In release-tag.sh, the DocSearch facet filter should be updated as the 
> release version before creating  git tag. 
> If we missed the step, the facet filter is wrong in the new release doc:  
> https://github.com/apache/spark/blame/v3.2.0/docs/_config.yml#L42



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Affects Version/s: 3.3.0

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
> +---++--+--+
> | col_name | data_type | comment |
> +---++--+--+
> | a | string       | NULL |
> | b | char(50)  | NULL |
> | c | int            | NULL |
> +---++--+–+
> >>> select * from t2_orc where a='a';
> ++++--+
> | a | b | c |
> ++++--+
> | a | b | 1 |
> | a | b | 2 |
> | a | b | 3 |
> | a | b | 4 |
> | a | b | 5 |
> ++++–+
> >>> select * from t2_orc where b='b';
> ++++--+
> | a | b | c |
> ++++--+
> ++++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")

2021-10-19 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430392#comment-17430392
 ] 

Yikun Jiang commented on SPARK-36348:
-

revisit this, it had been already fixed in current master branch, will add a 
simple PR to optimized the testcase. 

{code:python}
pidx = pd.Index([10, 20, 15, 30, 45, None], name="x")
psidx = ps.Index(pidx)
self.assert_eq(psidx.astype(bool), pidx.astype(bool))
self.assert_eq(psidx.astype(str), pidx.astype(str))
{code}




> unexpected Index loaded: pd.Index([10, 20, None], name="x")
> ---
>
> Key: SPARK-36348
> URL: https://issues.apache.org/jira/browse/SPARK-36348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> {code:python}
> pidx = pd.Index([10, 20, 15, 30, 45, None], name="x")
> psidx = ps.Index(pidx)
> self.assert_eq(psidx.astype(str), pidx.astype(str))
> {code}
> [left pandas on spark]:  Index(['10.0', '20.0', '15.0', '30.0', '45.0', 
> 'nan'], dtype='object', name='x')
> [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', 
> name='x')
> The index is loaded as float64, so the follow step like astype would be diff 
> with pandas
> [1] 
> https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type

2021-10-19 Thread frankli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

frankli updated SPARK-37051:

Affects Version/s: 3.2.1

> The filter operator gets wrong results in ORC's char type
> -
>
> Key: SPARK-37051
> URL: https://issues.apache.org/jira/browse/SPARK-37051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
> Environment: Spark 3.1.2
> Scala 2.12 / Java 1.8
>Reporter: frankli
>Priority: Critical
>
> When I try the following sample SQL on  the TPCDS data, the filter operator 
> returns an empty row set (shown in web ui).
> _select * from item where i_category = 'Music' limit 100;_
> The table is in ORC format, and i_category is char(50) type.
> I guest that the char(50) type will remains redundant blanks after the actual 
> word.
> It will affect the boolean value of  "x.equals(Y)", and results in wrong 
> results.
> Luckily, the varchar type is OK. 
>  
> This bug can be reproduced by a few steps.
> >>> desc t2_orc;
> +---++--+--+
> | col_name | data_type | comment |
> +---++--+--+
> | a | string       | NULL |
> | b | char(50)  | NULL |
> | c | int            | NULL |
> +---++--+–+
> >>> select * from t2_orc where a='a';
> ++++--+
> | a | b | c |
> ++++--+
> | a | b | 1 |
> | a | b | 2 |
> | a | b | 3 |
> | a | b | 4 |
> | a | b | 5 |
> ++++–+
> >>> select * from t2_orc where b='b';
> ++++--+
> | a | b | c |
> ++++--+
> ++++--+
>  
> By the way, Spark's tests should add more cases on the char type.
>  
> == Physical Plan ==
>  CollectLimit (3)
>  +- Filter (2)
>  +- Scan orc tpcds_bin_partitioned_orc_2.item (1)
> (1) Scan orc tpcds_bin_partitioned_orc_2.item
>  Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Batched: false
>  Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item]
>  PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music         
> )]
>  ReadSchema: 
> struct
> (2) Filter
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music         ))+
> (3) CollectLimit
>  Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, 
> i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, 
> i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, 
> i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, 
> i_color#17, i_units#18, i_container#19, i_manager_id#20, 
> i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, 
> i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, 
> i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, 
> i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, 
> i_units#18, i_container#19, i_manager_id#20, i_product_name#21]
>  Arguments: 100
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37057) Fix wrong DocSearch facet filter in release-tag.sh

2021-10-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37057:


Assignee: Gengliang Wang  (was: Apache Spark)

> Fix wrong DocSearch facet filter in release-tag.sh
> --
>
> Key: SPARK-37057
> URL: https://issues.apache.org/jira/browse/SPARK-37057
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> In release-tag.sh, the DocSearch facet filter should be updated as the 
> release version before creating  git tag. 
> If we missed the step, the facet filter is wrong in the new release doc:  
> https://github.com/apache/spark/blame/v3.2.0/docs/_config.yml#L42



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37057) Fix wrong DocSearch facet filter in release-tag.sh

2021-10-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430368#comment-17430368
 ] 

Apache Spark commented on SPARK-37057:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34328

> Fix wrong DocSearch facet filter in release-tag.sh
> --
>
> Key: SPARK-37057
> URL: https://issues.apache.org/jira/browse/SPARK-37057
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> In release-tag.sh, the DocSearch facet filter should be updated as the 
> release version before creating  git tag. 
> If we missed the step, the facet filter is wrong in the new release doc:  
> https://github.com/apache/spark/blame/v3.2.0/docs/_config.yml#L42



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >