[jira] [Resolved] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
[ https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37049. --- Fix Version/s: 3.1.3 3.2.1 3.3.0 Resolution: Fixed Issue resolved by pull request 34319 [https://github.com/apache/spark/pull/34319] > executorIdleTimeout is not working for pending pods on K8s > -- > > Key: SPARK-37049 > URL: https://issues.apache.org/jira/browse/SPARK-37049 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: wwei >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > > SPARK-33099 added the support to respect > "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. > However, when it checks if a pending executor pod is timed out, it checks > against the pod's "startTime". A pending pod "startTime" is empty, and this > causes the function "isExecutorIdleTimedOut()" always return true for pending > pods. > This caused the issue, pending pods are deleted immediately when a stage is > finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37049) executorIdleTimeout is not working for pending pods on K8s
[ https://issues.apache.org/jira/browse/SPARK-37049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37049: - Assignee: wwei > executorIdleTimeout is not working for pending pods on K8s > -- > > Key: SPARK-37049 > URL: https://issues.apache.org/jira/browse/SPARK-37049 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: wwei >Priority: Major > > SPARK-33099 added the support to respect > "spark.dynamicAllocation.executorIdleTimeout" in ExecutorPodsAllocator. > However, when it checks if a pending executor pod is timed out, it checks > against the pod's "startTime". A pending pod "startTime" is empty, and this > causes the function "isExecutorIdleTimedOut()" always return true for pending > pods. > This caused the issue, pending pods are deleted immediately when a stage is > finished and several new pods got recreated again in the next stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27696) kubernetes driver pod not deleted after finish.
[ https://issues.apache.org/jira/browse/SPARK-27696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430971#comment-17430971 ] Jin Xing commented on SPARK-27696: -- We also suffer from this issue. From our side, we hope Spark support an approach to do self clean up for driver pod. [~Andrew HUALI] Would you please share the PR or give some insights about spark.kubernetes.submission.deleteCompletedPod ? > kubernetes driver pod not deleted after finish. > --- > > Key: SPARK-27696 > URL: https://issues.apache.org/jira/browse/SPARK-27696 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.0 >Reporter: Henry Yu >Priority: Minor > > When submit to k8s, driver pod not deleted after job completion. > While k8s checks driver pod name not existing, It is especially painful when > we use workflow tool to resubmit the failed spark job. (by the way, client > always exist with 0 is another painful issue) > I have fix this with a new config > spark.kubernetes.submission.deleteCompletedPod=true in our home maintained > spark version. > Do you guys have more insights or I can make a pr on this issue? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")
[ https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36348. -- Fix Version/s: 3.3.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/34335 > unexpected Index loaded: pd.Index([10, 20, None], name="x") > --- > > Key: SPARK-36348 > URL: https://issues.apache.org/jira/browse/SPARK-36348 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > > {code:python} > pidx = pd.Index([10, 20, 15, 30, 45, None], name="x") > psidx = ps.Index(pidx) > self.assert_eq(psidx.astype(str), pidx.astype(str)) > {code} > [left pandas on spark]: Index(['10.0', '20.0', '15.0', '30.0', '45.0', > 'nan'], dtype='object', name='x') > [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', > name='x') > The index is loaded as float64, so the follow step like astype would be diff > with pandas > [1] > https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37068) Confusing tgz filename for download
[ https://issues.apache.org/jira/browse/SPARK-37068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Yu updated SPARK-37068: - Attachment: spark-download-issue.png > Confusing tgz filename for download > --- > > Key: SPARK-37068 > URL: https://issues.apache.org/jira/browse/SPARK-37068 > Project: Spark > Issue Type: Bug > Components: Build, Documentation >Affects Versions: 3.2.0 >Reporter: James Yu >Priority: Minor > Attachments: spark-download-issue.png > > > In the Spark download webpage [https://spark.apache.org/downloads.html], the > package type dropdown says "Hadoop 3.3", but the Download Spark tgz filename > contains "hadoop3.2" in it. It is confusing; which version is correct? > > Download Apache Spark(TM) > # Choose a Spark release: 3.2.0 (Oct 13 2021) > # Choose a package type: Pre-built for Apache Hadoop 3.3 and later > # Download Spark: spark-3.2.0-bin-hadoop3.2.tgz > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37068) Confusing tgz filename for download
James Yu created SPARK-37068: Summary: Confusing tgz filename for download Key: SPARK-37068 URL: https://issues.apache.org/jira/browse/SPARK-37068 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 3.2.0 Reporter: James Yu In the Spark download webpage [https://spark.apache.org/downloads.html], the package type dropdown says "Hadoop 3.3", but the Download Spark tgz filename contains "hadoop3.2" in it. It is confusing; which version is correct? Download Apache Spark(TM) # Choose a Spark release: 3.2.0 (Oct 13 2021) # Choose a package type: Pre-built for Apache Hadoop 3.3 and later # Download Spark: spark-3.2.0-bin-hadoop3.2.tgz -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
[ https://issues.apache.org/jira/browse/SPARK-37067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430872#comment-17430872 ] Apache Spark commented on SPARK-37067: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/34338 > DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon > > > Key: SPARK-37067 > URL: https://issues.apache.org/jira/browse/SPARK-37067 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Linhong Liu >Priority: Major > > For the zoneid with format like "+" or "+0730", it can be parsed by > `ZoneId.of()` but will rejected by Spark's > `DateTimeUtils.stringToTimestamp()`. it means we will return null for some > valid datetime string, such as: `2021-10-11T03:58:03.000+0700` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
[ https://issues.apache.org/jira/browse/SPARK-37067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430871#comment-17430871 ] Apache Spark commented on SPARK-37067: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/34338 > DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon > > > Key: SPARK-37067 > URL: https://issues.apache.org/jira/browse/SPARK-37067 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Linhong Liu >Priority: Major > > For the zoneid with format like "+" or "+0730", it can be parsed by > `ZoneId.of()` but will rejected by Spark's > `DateTimeUtils.stringToTimestamp()`. it means we will return null for some > valid datetime string, such as: `2021-10-11T03:58:03.000+0700` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
[ https://issues.apache.org/jira/browse/SPARK-37067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37067: Assignee: (was: Apache Spark) > DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon > > > Key: SPARK-37067 > URL: https://issues.apache.org/jira/browse/SPARK-37067 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Linhong Liu >Priority: Major > > For the zoneid with format like "+" or "+0730", it can be parsed by > `ZoneId.of()` but will rejected by Spark's > `DateTimeUtils.stringToTimestamp()`. it means we will return null for some > valid datetime string, such as: `2021-10-11T03:58:03.000+0700` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
[ https://issues.apache.org/jira/browse/SPARK-37067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37067: Assignee: Apache Spark > DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon > > > Key: SPARK-37067 > URL: https://issues.apache.org/jira/browse/SPARK-37067 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Linhong Liu >Assignee: Apache Spark >Priority: Major > > For the zoneid with format like "+" or "+0730", it can be parsed by > `ZoneId.of()` but will rejected by Spark's > `DateTimeUtils.stringToTimestamp()`. it means we will return null for some > valid datetime string, such as: `2021-10-11T03:58:03.000+0700` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
Linhong Liu created SPARK-37067: --- Summary: DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon Key: SPARK-37067 URL: https://issues.apache.org/jira/browse/SPARK-37067 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.0, 3.1.0 Reporter: Linhong Liu For the zoneid with format like "+" or "+0730", it can be parsed by `ZoneId.of()` but will rejected by Spark's `DateTimeUtils.stringToTimestamp()`. it means we will return null for some valid datetime string, such as: `2021-10-11T03:58:03.000+0700` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")
[ https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36348: Assignee: Yikun Jiang > unexpected Index loaded: pd.Index([10, 20, None], name="x") > --- > > Key: SPARK-36348 > URL: https://issues.apache.org/jira/browse/SPARK-36348 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > {code:python} > pidx = pd.Index([10, 20, 15, 30, 45, None], name="x") > psidx = ps.Index(pidx) > self.assert_eq(psidx.astype(str), pidx.astype(str)) > {code} > [left pandas on spark]: Index(['10.0', '20.0', '15.0', '30.0', '45.0', > 'nan'], dtype='object', name='x') > [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', > name='x') > The index is loaded as float64, so the follow step like astype would be diff > with pandas > [1] > https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37066) Improve ORC RecordReader's error message
[ https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430858#comment-17430858 ] Apache Spark commented on SPARK-37066: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34337 > Improve ORC RecordReader's error message > > > Key: SPARK-37066 > URL: https://issues.apache.org/jira/browse/SPARK-37066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > In below error message, we don't know the actual file path > {code} > 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID > 257) > java.lang.ArrayIndexOutOfBoundsException: 1024 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292) > at > org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517) > at > org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802) > at > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37066) Improve ORC RecordReader's error message
[ https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37066: Assignee: Apache Spark > Improve ORC RecordReader's error message > > > Key: SPARK-37066 > URL: https://issues.apache.org/jira/browse/SPARK-37066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > In below error message, we don't know the actual file path > {code} > 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID > 257) > java.lang.ArrayIndexOutOfBoundsException: 1024 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292) > at > org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517) > at > org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802) > at > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37066) Improve ORC RecordReader's error message
[ https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430857#comment-17430857 ] Apache Spark commented on SPARK-37066: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34337 > Improve ORC RecordReader's error message > > > Key: SPARK-37066 > URL: https://issues.apache.org/jira/browse/SPARK-37066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > In below error message, we don't know the actual file path > {code} > 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID > 257) > java.lang.ArrayIndexOutOfBoundsException: 1024 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292) > at > org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517) > at > org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802) > at > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37066) Improve ORC RecordReader's error message
[ https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37066: Assignee: (was: Apache Spark) > Improve ORC RecordReader's error message > > > Key: SPARK-37066 > URL: https://issues.apache.org/jira/browse/SPARK-37066 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > In below error message, we don't know the actual file path > {code} > 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID > 257) > java.lang.ArrayIndexOutOfBoundsException: 1024 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292) > at > org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517) > at > org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802) > at > org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37066) Improve ORC RecordReader's error message
angerszhu created SPARK-37066: - Summary: Improve ORC RecordReader's error message Key: SPARK-37066 URL: https://issues.apache.org/jira/browse/SPARK-37066 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu In below error message, we don't know the actual file path {code} 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID 257) java.lang.ArrayIndexOutOfBoundsException: 1024 at org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517) at org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802) at org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37035) Improve error message when use vectorize reader
[ https://issues.apache.org/jira/browse/SPARK-37035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37035: -- Parent: SPARK-37065 Issue Type: Sub-task (was: Improvement) > Improve error message when use vectorize reader > --- > > Key: SPARK-37035 > URL: https://issues.apache.org/jira/browse/SPARK-37035 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > > Vectorized reader won't show which file read failed. > > None-vectorize parquet reader > {code} > cutionException: Encounter error while reading parquet files. One possible > cause: Parquet column cannot be converted in the corresponding files. Details: > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:193) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 1 in block 0 in file hdfs://path/to/failed/file > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > ... 15 more > {code} > Vectorize parquet reader > {code} > 21/10/15 18:01:54 WARN TaskSetManager: Lost task 1881.0 in stage 16.0 (TID > 10380, ip-10-130-169-140.idata-server.shopee.io, executor 168): TaskKilled > (Stage cancelled) > : An error occurred while calling o362.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 963 > in stage 17.0 failed 4 times, most recent failure: Lost task 963.3 in stage > 17.0 (TID 10351, ip-10-130-75-201.idata-server.shopee.io, executor 99): > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary > at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364) > at > org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getLong(MutableColumnarRow.java:120) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anonfun$doExecute$2$$anonfun$apply$2.apply(DataSourceScanExec.scala:351) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anonfun$doExecute$2$$anonfun$apply$2.apply(DataSourceScanExec.scala:349) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at >
[jira] [Created] (SPARK-37065) improve File reader's error message for each data source
angerszhu created SPARK-37065: - Summary: improve File reader's error message for each data source Key: SPARK-37065 URL: https://issues.apache.org/jira/browse/SPARK-37065 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu improve File reader's error message for each data source -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37044) Add Row to __all__ in pyspark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37044. -- Fix Version/s: 3.3.0 Assignee: Maciej Szymkiewicz Resolution: Fixed Fixed in https://github.com/apache/spark/pull/34332 > Add Row to __all__ in pyspark.sql.types > --- > > Key: SPARK-37044 > URL: https://issues.apache.org/jira/browse/SPARK-37044 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.3.0 > > > Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from > {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import > *}} won't import {{Row}}. > It might be counter-intuitive, especially when we import {{Row}} from > {{types}} in {{examples}}. > Should we add it to {{__all__}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36763) Pull out ordering expressions
[ https://issues.apache.org/jira/browse/SPARK-36763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36763: Assignee: Apache Spark > Pull out ordering expressions > - > > Key: SPARK-36763 > URL: https://issues.apache.org/jira/browse/SPARK-36763 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Similar to > [PullOutGroupingExpressions|https://github.com/apache/spark/blob/7fd3f8f9ec55b364525407213ba1c631705686c5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala#L48]. > We can pull out ordering expressions to improve order performance. For > example: > {code:scala} > sql("create table t1(a int, b int) using parquet") > sql("insert into t1 values (1, 2)") > sql("insert into t1 values (3, 4)") > sql("select * from t1 order by a - b").explain > {code} > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Sort [(a#12 - b#13) ASC NULLS FIRST], true, 0 >+- Exchange rangepartitioning((a#12 - b#13) ASC NULLS FIRST, 5), > ENSURE_REQUIREMENTS, [id=#39] > +- FileScan parquet default.t1[a#12,b#13] > {noformat} > The {{Subtract}} will be evaluated 4 times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36763) Pull out ordering expressions
[ https://issues.apache.org/jira/browse/SPARK-36763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36763: Assignee: (was: Apache Spark) > Pull out ordering expressions > - > > Key: SPARK-36763 > URL: https://issues.apache.org/jira/browse/SPARK-36763 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > Similar to > [PullOutGroupingExpressions|https://github.com/apache/spark/blob/7fd3f8f9ec55b364525407213ba1c631705686c5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala#L48]. > We can pull out ordering expressions to improve order performance. For > example: > {code:scala} > sql("create table t1(a int, b int) using parquet") > sql("insert into t1 values (1, 2)") > sql("insert into t1 values (3, 4)") > sql("select * from t1 order by a - b").explain > {code} > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Sort [(a#12 - b#13) ASC NULLS FIRST], true, 0 >+- Exchange rangepartitioning((a#12 - b#13) ASC NULLS FIRST, 5), > ENSURE_REQUIREMENTS, [id=#39] > +- FileScan parquet default.t1[a#12,b#13] > {noformat} > The {{Subtract}} will be evaluated 4 times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37064) Fix outer join return the wrong max rows if other side is empty
[ https://issues.apache.org/jira/browse/SPARK-37064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37064: Assignee: Apache Spark > Fix outer join return the wrong max rows if other side is empty > --- > > Key: SPARK-37064 > URL: https://issues.apache.org/jira/browse/SPARK-37064 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > Outer join should return at least num rows of it's outer side, i.e left outer > join with its left side, right outer join with its right side, full outer > join with its both side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37064) Fix outer join return the wrong max rows if other side is empty
[ https://issues.apache.org/jira/browse/SPARK-37064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430846#comment-17430846 ] Apache Spark commented on SPARK-37064: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34336 > Fix outer join return the wrong max rows if other side is empty > --- > > Key: SPARK-37064 > URL: https://issues.apache.org/jira/browse/SPARK-37064 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > Outer join should return at least num rows of it's outer side, i.e left outer > join with its left side, right outer join with its right side, full outer > join with its both side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36763) Pull out ordering expressions
[ https://issues.apache.org/jira/browse/SPARK-36763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430847#comment-17430847 ] Apache Spark commented on SPARK-36763: -- User 'RabbidHY' has created a pull request for this issue: https://github.com/apache/spark/pull/34334 > Pull out ordering expressions > - > > Key: SPARK-36763 > URL: https://issues.apache.org/jira/browse/SPARK-36763 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > Similar to > [PullOutGroupingExpressions|https://github.com/apache/spark/blob/7fd3f8f9ec55b364525407213ba1c631705686c5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala#L48]. > We can pull out ordering expressions to improve order performance. For > example: > {code:scala} > sql("create table t1(a int, b int) using parquet") > sql("insert into t1 values (1, 2)") > sql("insert into t1 values (3, 4)") > sql("select * from t1 order by a - b").explain > {code} > {noformat} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Sort [(a#12 - b#13) ASC NULLS FIRST], true, 0 >+- Exchange rangepartitioning((a#12 - b#13) ASC NULLS FIRST, 5), > ENSURE_REQUIREMENTS, [id=#39] > +- FileScan parquet default.t1[a#12,b#13] > {noformat} > The {{Subtract}} will be evaluated 4 times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")
[ https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36348: Assignee: (was: Apache Spark) > unexpected Index loaded: pd.Index([10, 20, None], name="x") > --- > > Key: SPARK-36348 > URL: https://issues.apache.org/jira/browse/SPARK-36348 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > {code:python} > pidx = pd.Index([10, 20, 15, 30, 45, None], name="x") > psidx = ps.Index(pidx) > self.assert_eq(psidx.astype(str), pidx.astype(str)) > {code} > [left pandas on spark]: Index(['10.0', '20.0', '15.0', '30.0', '45.0', > 'nan'], dtype='object', name='x') > [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', > name='x') > The index is loaded as float64, so the follow step like astype would be diff > with pandas > [1] > https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37064) Fix outer join return the wrong max rows if other side is empty
[ https://issues.apache.org/jira/browse/SPARK-37064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37064: Assignee: (was: Apache Spark) > Fix outer join return the wrong max rows if other side is empty > --- > > Key: SPARK-37064 > URL: https://issues.apache.org/jira/browse/SPARK-37064 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > Outer join should return at least num rows of it's outer side, i.e left outer > join with its left side, right outer join with its right side, full outer > join with its both side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")
[ https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36348: Assignee: Apache Spark > unexpected Index loaded: pd.Index([10, 20, None], name="x") > --- > > Key: SPARK-36348 > URL: https://issues.apache.org/jira/browse/SPARK-36348 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > {code:python} > pidx = pd.Index([10, 20, 15, 30, 45, None], name="x") > psidx = ps.Index(pidx) > self.assert_eq(psidx.astype(str), pidx.astype(str)) > {code} > [left pandas on spark]: Index(['10.0', '20.0', '15.0', '30.0', '45.0', > 'nan'], dtype='object', name='x') > [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', > name='x') > The index is loaded as float64, so the follow step like astype would be diff > with pandas > [1] > https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")
[ https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430845#comment-17430845 ] Apache Spark commented on SPARK-36348: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/34335 > unexpected Index loaded: pd.Index([10, 20, None], name="x") > --- > > Key: SPARK-36348 > URL: https://issues.apache.org/jira/browse/SPARK-36348 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > {code:python} > pidx = pd.Index([10, 20, 15, 30, 45, None], name="x") > psidx = ps.Index(pidx) > self.assert_eq(psidx.astype(str), pidx.astype(str)) > {code} > [left pandas on spark]: Index(['10.0', '20.0', '15.0', '30.0', '45.0', > 'nan'], dtype='object', name='x') > [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', > name='x') > The index is loaded as float64, so the follow step like astype would be diff > with pandas > [1] > https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37064) Fix outer join return the wrong max rows if other side is empty
XiDuo You created SPARK-37064: - Summary: Fix outer join return the wrong max rows if other side is empty Key: SPARK-37064 URL: https://issues.apache.org/jira/browse/SPARK-37064 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.3.0 Reporter: XiDuo You Outer join should return at least num rows of it's outer side, i.e left outer join with its left side, right outer join with its right side, full outer join with its both side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37063) SQL Adaptive Query Execution QA: Phase 2
[ https://issues.apache.org/jira/browse/SPARK-37063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430838#comment-17430838 ] XiDuo You commented on SPARK-37063: --- thank you [~dongjoon] for creating this umbrella ! > SQL Adaptive Query Execution QA: Phase 2 > > > Key: SPARK-37063 > URL: https://issues.apache.org/jira/browse/SPARK-37063 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37002) Introduce the 'compute.eager_check' option
[ https://issues.apache.org/jira/browse/SPARK-37002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37002. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34281 [https://github.com/apache/spark/pull/34281] > Introduce the 'compute.eager_check' option > -- > > Key: SPARK-37002 > URL: https://issues.apache.org/jira/browse/SPARK-37002 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > > https://issues.apache.org/jira/browse/SPARK-36968 > [https://github.com/apache/spark/pull/34235] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37002) Introduce the 'compute.eager_check' option
[ https://issues.apache.org/jira/browse/SPARK-37002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37002: Assignee: dch nguyen > Introduce the 'compute.eager_check' option > -- > > Key: SPARK-37002 > URL: https://issues.apache.org/jira/browse/SPARK-37002 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-36968 > [https://github.com/apache/spark/pull/34235] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests
[ https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-37059. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/34330 > Ensure the sort order of the output in the PySpark doctests > --- > > Key: SPARK-37059 > URL: https://issues.apache.org/jira/browse/SPARK-37059 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.3.0 > > > The collect_set builtin function doesn't ensure the sort order of its result > for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order > of the result rows. > Nevertheless, their PySpark doctests assume a certain kind of sort order, > causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
[ https://issues.apache.org/jira/browse/SPARK-37041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-37041. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34312 [https://github.com/apache/spark/pull/34312] > Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS > -- > > Key: SPARK-37041 > URL: https://issues.apache.org/jira/browse/SPARK-37041 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.3.0 > > > Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy > upgrade Thrift. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37041) Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS
[ https://issues.apache.org/jira/browse/SPARK-37041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-37041: --- Assignee: Yuming Wang > Backport HIVE-15025: Secure-Socket-Layer (SSL) support for HMS > -- > > Key: SPARK-37041 > URL: https://issues.apache.org/jira/browse/SPARK-37041 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Backport https://issues.apache.org/jira/browse/HIVE-15025 to make it easy > upgrade Thrift. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37043) Cancel all running job after AQE plan finished
[ https://issues.apache.org/jira/browse/SPARK-37043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430797#comment-17430797 ] Erik Krogen commented on SPARK-37043: - [~ulysses] any concerns if I make this a sub-task of SPARK-37063 to track it alongside other AQE fixes? > Cancel all running job after AQE plan finished > -- > > Key: SPARK-37043 > URL: https://issues.apache.org/jira/browse/SPARK-37043 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > We see stage was still running after AQE plan finished. This is because the > plan which contains a empty join has been converted to `LocalTableScanExec` > during `AQEOptimizer`, but the other side of this join is still running > (shuffle map stage). > > It's no meaning to keep running the stage, It's better to cancel the running > stage after AQE plan finished in case wasting the task resource. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33828) SQL Adaptive Query Execution QA
[ https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430798#comment-17430798 ] Erik Krogen commented on SPARK-33828: - Thanks [~dongjoon]! > SQL Adaptive Query Execution QA > --- > > Key: SPARK-33828 > URL: https://issues.apache.org/jira/browse/SPARK-33828 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Critical > Labels: releasenotes > Fix For: 3.2.0 > > > Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA > issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by > default and collect all information in order to do QA for this feature in > Apache Spark 3.2.0 timeframe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch
[ https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37062: Assignee: (was: Apache Spark) > Introduce a new data source for providing consistent set of rows per > microbatch > --- > > Key: SPARK-37062 > URL: https://issues.apache.org/jira/browse/SPARK-37062 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Priority: Major > > The "rate" data source has been known to be used as a benchmark for streaming > query. > While this helps to put the query to the limit (how many rows the query could > process per second), the rate data source doesn't provide consistent rows per > batch into stream, which leads two environments be hard to compare with. > For example, in many cases, you may want to compare the metrics in the > batches between test environments (like running same streaming query with > different options). These metrics are strongly affected if the distribution > of input rows in batches are changing, especially a micro-batch has been > lagged (in any reason) and rate data source produces more input rows to the > next batch. > Also, when you test against streaming aggregation, you may want the data > source produces the same set of input rows per batch (deterministic), so that > you can plan how these input rows will be aggregated and how state rows will > be evicted, and craft the test query based on the plan. > The requirements of new data source would follow: > * it should produce a specific number of input rows as requested > * it should also include a timestamp (event time) into each row > ** to make the input rows fully deterministic, timestamp should be configured > as well (like start timestamp & amount of advance per batch) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch
[ https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37062: Assignee: Apache Spark > Introduce a new data source for providing consistent set of rows per > microbatch > --- > > Key: SPARK-37062 > URL: https://issues.apache.org/jira/browse/SPARK-37062 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > The "rate" data source has been known to be used as a benchmark for streaming > query. > While this helps to put the query to the limit (how many rows the query could > process per second), the rate data source doesn't provide consistent rows per > batch into stream, which leads two environments be hard to compare with. > For example, in many cases, you may want to compare the metrics in the > batches between test environments (like running same streaming query with > different options). These metrics are strongly affected if the distribution > of input rows in batches are changing, especially a micro-batch has been > lagged (in any reason) and rate data source produces more input rows to the > next batch. > Also, when you test against streaming aggregation, you may want the data > source produces the same set of input rows per batch (deterministic), so that > you can plan how these input rows will be aggregated and how state rows will > be evicted, and craft the test query based on the plan. > The requirements of new data source would follow: > * it should produce a specific number of input rows as requested > * it should also include a timestamp (event time) into each row > ** to make the input rows fully deterministic, timestamp should be configured > as well (like start timestamp & amount of advance per batch) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch
[ https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430796#comment-17430796 ] Apache Spark commented on SPARK-37062: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/34333 > Introduce a new data source for providing consistent set of rows per > microbatch > --- > > Key: SPARK-37062 > URL: https://issues.apache.org/jira/browse/SPARK-37062 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Priority: Major > > The "rate" data source has been known to be used as a benchmark for streaming > query. > While this helps to put the query to the limit (how many rows the query could > process per second), the rate data source doesn't provide consistent rows per > batch into stream, which leads two environments be hard to compare with. > For example, in many cases, you may want to compare the metrics in the > batches between test environments (like running same streaming query with > different options). These metrics are strongly affected if the distribution > of input rows in batches are changing, especially a micro-batch has been > lagged (in any reason) and rate data source produces more input rows to the > next batch. > Also, when you test against streaming aggregation, you may want the data > source produces the same set of input rows per batch (deterministic), so that > you can plan how these input rows will be aggregated and how state rows will > be evicted, and craft the test query based on the plan. > The requirements of new data source would follow: > * it should produce a specific number of input rows as requested > * it should also include a timestamp (event time) into each row > ** to make the input rows fully deterministic, timestamp should be configured > as well (like start timestamp & amount of advance per batch) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33828) SQL Adaptive Query Execution QA
[ https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430790#comment-17430790 ] Dongjoon Hyun commented on SPARK-33828: --- To All. This JIRA contains all efforts delivered at Apache Spark 3.2.0. SPARK-37063 will lead the remaining efforts in Apache Spark 3.3 timeframe. > SQL Adaptive Query Execution QA > --- > > Key: SPARK-33828 > URL: https://issues.apache.org/jira/browse/SPARK-33828 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Critical > Labels: releasenotes > Fix For: 3.2.0 > > > Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA > issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by > default and collect all information in order to do QA for this feature in > Apache Spark 3.2.0 timeframe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33828) SQL Adaptive Query Execution QA
[ https://issues.apache.org/jira/browse/SPARK-33828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33828. --- Fix Version/s: 3.2.0 Resolution: Done > SQL Adaptive Query Execution QA > --- > > Key: SPARK-33828 > URL: https://issues.apache.org/jira/browse/SPARK-33828 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Critical > Labels: releasenotes > Fix For: 3.2.0 > > > Since SPARK-31412 is delivered at 3.0.0, we received and handled many JIRA > issues at 3.0.x/3.1.0/3.2.0. This umbrella JIRA issue aims to enable it by > default and collect all information in order to do QA for this feature in > Apache Spark 3.2.0 timeframe. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36443) Demote BroadcastJoin causes performance regression and increases OOM risks
[ https://issues.apache.org/jira/browse/SPARK-36443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36443: -- Parent: SPARK-37063 Issue Type: Sub-task (was: Bug) > Demote BroadcastJoin causes performance regression and increases OOM risks > -- > > Key: SPARK-36443 > URL: https://issues.apache.org/jira/browse/SPARK-36443 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kent Yao >Priority: Major > Attachments: image-2021-08-06-11-24-34-122.png, > image-2021-08-06-17-57-15-765.png, screenshot-1.png > > > > h2. A test case > Use bin/spark-sql with local mode and all other default settings with 3.1.2 > to run the case below > {code:sql} > // Some comments here > set spark.sql.shuffle.partitions=20; > set spark.sql.adaptive.enabled=true; > -- set spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin=0; -- > (default 0.2)enable this for not demote bhj > set spark.sql.autoBroadcastJoinThreshold=200; > SELECT > l.id % 12345 k, > sum(l.id) sum, > count(l.id) cnt, > avg(l.id) avg, > min(l.id) min, > max(l.id) max > from (select id % 3 id from range(0, 1e8, 1, 100)) l > left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 100) > group by gid) r ON l.id = r.id > GROUP BY 1; > {code} > > 1. demote bhj w/ nonEmptyPartitionRatioForBroadcastJoin comment out > > | | > ||[Job Id > ▾|http://localhost:4040/jobs/?=Job+Id=false=100#completed]||[Description|http://localhost:4040/jobs/?=Description=100#completed]||[Submitted|http://localhost:4040/jobs/?=Submitted=100#completed]||[Duration|http://localhost:4040/jobs/?=Duration=100#completed]||Stages: > Succeeded/Total||Tasks (for all stages): Succeeded/Total|| > |4|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=4]|2021/08/06 > 17:31:37|71 ms|1/1 (4 skipped)|3/3 (205 skipped) > | > |3|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=3]|2021/08/06 > 17:31:18|19 s|1/1 (3 skipped)|4/4 (201 skipped) > | > |2|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=2]|2021/08/06 > 17:31:18|87 ms|1/1 (1 skipped)|1/1 (100 skipped) > | > |1|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=1]|2021/08/06 > 17:31:16|2 s|1/1|100/100 > | > |0|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=0]|2021/08/06 > 17:31:15|2 s|1/1|100/100 | > 2. set nonEmptyPartitionRatioForBroadcastJoin to 0 to tell spark not to > demote bhj > > ||[Job Id (Job Group) > ▾|http://localhost:4040/jobs/?=Job+Id+%28Job+Group%29=false=100#completed]||[Description|http://localhost:4040/jobs/?=Description=100#completed]||[Submitted|http://localhost:4040/jobs/?=Submitted=100#completed]||[Duration|http://localhost:4040/jobs/?=Duration=100#completed]||Stages: > Succeeded/Total||Tasks (for all stages): Succeeded/Total|| > |5|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=5]|2021/08/06 > 18:25:15|29 ms|1/1 (2 skipped)|3/3 (200 skipped) > | > |4|SELECT l.id % 12345
[jira] [Updated] (SPARK-35540) Make config maxShuffledHashJoinLocalMapThreshold fallback to advisoryPartitionSizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-35540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35540: -- Parent: SPARK-37063 Issue Type: Sub-task (was: Bug) > Make config maxShuffledHashJoinLocalMapThreshold fallback to > advisoryPartitionSizeInBytes > - > > Key: SPARK-35540 > URL: https://issues.apache.org/jira/browse/SPARK-35540 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: XiDuo You >Priority: Major > > Make config `spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold` > fallback to `spark.sql.adaptive.advisoryPartitionSizeInBytes` so that we can > convert SMJ to SHJ in AQE by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36443) Demote BroadcastJoin causes performance regression and increases OOM risks
[ https://issues.apache.org/jira/browse/SPARK-36443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36443: -- Parent: (was: SPARK-33828) Issue Type: Bug (was: Sub-task) > Demote BroadcastJoin causes performance regression and increases OOM risks > -- > > Key: SPARK-36443 > URL: https://issues.apache.org/jira/browse/SPARK-36443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kent Yao >Priority: Major > Attachments: image-2021-08-06-11-24-34-122.png, > image-2021-08-06-17-57-15-765.png, screenshot-1.png > > > > h2. A test case > Use bin/spark-sql with local mode and all other default settings with 3.1.2 > to run the case below > {code:sql} > // Some comments here > set spark.sql.shuffle.partitions=20; > set spark.sql.adaptive.enabled=true; > -- set spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin=0; -- > (default 0.2)enable this for not demote bhj > set spark.sql.autoBroadcastJoinThreshold=200; > SELECT > l.id % 12345 k, > sum(l.id) sum, > count(l.id) cnt, > avg(l.id) avg, > min(l.id) min, > max(l.id) max > from (select id % 3 id from range(0, 1e8, 1, 100)) l > left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, 100) > group by gid) r ON l.id = r.id > GROUP BY 1; > {code} > > 1. demote bhj w/ nonEmptyPartitionRatioForBroadcastJoin comment out > > | | > ||[Job Id > ▾|http://localhost:4040/jobs/?=Job+Id=false=100#completed]||[Description|http://localhost:4040/jobs/?=Description=100#completed]||[Submitted|http://localhost:4040/jobs/?=Submitted=100#completed]||[Duration|http://localhost:4040/jobs/?=Duration=100#completed]||Stages: > Succeeded/Total||Tasks (for all stages): Succeeded/Total|| > |4|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=4]|2021/08/06 > 17:31:37|71 ms|1/1 (4 skipped)|3/3 (205 skipped) > | > |3|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=3]|2021/08/06 > 17:31:18|19 s|1/1 (3 skipped)|4/4 (201 skipped) > | > |2|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=2]|2021/08/06 > 17:31:18|87 ms|1/1 (1 skipped)|1/1 (100 skipped) > | > |1|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=1]|2021/08/06 > 17:31:16|2 s|1/1|100/100 > | > |0|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=0]|2021/08/06 > 17:31:15|2 s|1/1|100/100 | > 2. set nonEmptyPartitionRatioForBroadcastJoin to 0 to tell spark not to > demote bhj > > ||[Job Id (Job Group) > ▾|http://localhost:4040/jobs/?=Job+Id+%28Job+Group%29=false=100#completed]||[Description|http://localhost:4040/jobs/?=Description=100#completed]||[Submitted|http://localhost:4040/jobs/?=Submitted=100#completed]||[Duration|http://localhost:4040/jobs/?=Duration=100#completed]||Stages: > Succeeded/Total||Tasks (for all stages): Succeeded/Total|| > |5|SELECT l.id % 12345 k, sum(l.id) sum, count(l.id) cnt, avg(l.id) avg, > min(l.id) min, max(l.id) max from (select id % 3 id from range(0, 1e8, 1, > 100)) l left join (SELECT max(id) as id, id % 2 gid FROM range(0, 1000, 2, > 100) group by gid) r ON l.id = r.id GROUP BY 1[main at > NativeMethodAccessorImpl.java:0|http://localhost:4040/jobs/job/?id=5]|2021/08/06 > 18:25:15|29 ms|1/1 (2 skipped)|3/3 (200 skipped) > | > |4|SELECT l.id %
[jira] [Updated] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins
[ https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32304: -- Parent: SPARK-37063 Issue Type: Sub-task (was: Bug) > Flaky Test: AdaptiveQueryExecSuite.multiple joins > - > > Key: SPARK-32304 > URL: https://issues.apache.org/jira/browse/SPARK-32304 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > AdaptiveQueryExecSuite: > - Change merge join to broadcast join (313 milliseconds) > - Reuse the parallelism of CoalescedShuffleReaderExec in > LocalShuffleReaderExec (265 milliseconds) > - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds) > - Empty stage coalesced to 0-partition RDD (514 milliseconds) > - Scalar subquery (406 milliseconds) > - Scalar subquery in later stages (500 milliseconds) > - multiple joins *** FAILED *** (739 milliseconds) > ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft > :- BroadcastQueryStage 5 > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, > false] as bigint))), [id=#504817] > : +- CustomShuffleReader local > :+- ShuffleQueryStage 4 > : +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777] > : +- *(7) SortMergeJoin [key#251418], [a#251428], Inner > : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0 > : : +- CustomShuffleReader coalesced > : : +- ShuffleQueryStage 0 > : :+- Exchange hashpartitioning(key#251418, 5), > true, [id=#504656] > : : +- *(1) Filter (isnotnull(value#251419) AND > (cast(value#251419 as int) = 1)) > : : +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) > AS value#251419] > : : +- Scan[obj#251417] > : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0 > :+- CustomShuffleReader coalesced > : +- ShuffleQueryStage 1 > : +- Exchange hashpartitioning(a#251428, 5), true, > [id=#504663] > : +- *(2) Filter (b#251429 = 1) > :+- *(2) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, > knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429] > : +- Scan[obj#251427] > +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) > AS l#251499] > : +- Scan[obj#251497] > +- BroadcastQueryStage 6 > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, > int, false] as bigint))), [id=#504830] >+- CustomShuffleReader local > +- ShuffleQueryStage 3 > +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694] > +- *(4) Filter (a#251438 = 1) >+- *(4) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, > unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439] > +- Scan[obj#251437] > , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, >
[jira] [Updated] (SPARK-35540) Make config maxShuffledHashJoinLocalMapThreshold fallback to advisoryPartitionSizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-35540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35540: -- Parent: (was: SPARK-33828) Issue Type: Bug (was: Sub-task) > Make config maxShuffledHashJoinLocalMapThreshold fallback to > advisoryPartitionSizeInBytes > - > > Key: SPARK-35540 > URL: https://issues.apache.org/jira/browse/SPARK-35540 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: XiDuo You >Priority: Major > > Make config `spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold` > fallback to `spark.sql.adaptive.advisoryPartitionSizeInBytes` so that we can > convert SMJ to SHJ in AQE by default. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32304) Flaky Test: AdaptiveQueryExecSuite.multiple joins
[ https://issues.apache.org/jira/browse/SPARK-32304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32304: -- Parent: (was: SPARK-33828) Issue Type: Bug (was: Sub-task) > Flaky Test: AdaptiveQueryExecSuite.multiple joins > - > > Key: SPARK-32304 > URL: https://issues.apache.org/jira/browse/SPARK-32304 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > AdaptiveQueryExecSuite: > - Change merge join to broadcast join (313 milliseconds) > - Reuse the parallelism of CoalescedShuffleReaderExec in > LocalShuffleReaderExec (265 milliseconds) > - Reuse the default parallelism in LocalShuffleReaderExec (230 milliseconds) > - Empty stage coalesced to 0-partition RDD (514 milliseconds) > - Scalar subquery (406 milliseconds) > - Scalar subquery in later stages (500 milliseconds) > - multiple joins *** FAILED *** (739 milliseconds) > ArrayBuffer(BroadcastHashJoin [b#251429], [a#251438], Inner, BuildLeft > :- BroadcastQueryStage 5 > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[3, int, > false] as bigint))), [id=#504817] > : +- CustomShuffleReader local > :+- ShuffleQueryStage 4 > : +- Exchange hashpartitioning(b#251429, 5), true, [id=#504777] > : +- *(7) SortMergeJoin [key#251418], [a#251428], Inner > : :- *(5) Sort [key#251418 ASC NULLS FIRST], false, 0 > : : +- CustomShuffleReader coalesced > : : +- ShuffleQueryStage 0 > : :+- Exchange hashpartitioning(key#251418, 5), > true, [id=#504656] > : : +- *(1) Filter (isnotnull(value#251419) AND > (cast(value#251419 as int) = 1)) > : : +- *(1) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#251418, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) > AS value#251419] > : : +- Scan[obj#251417] > : +- *(6) Sort [a#251428 ASC NULLS FIRST], false, 0 > :+- CustomShuffleReader coalesced > : +- ShuffleQueryStage 1 > : +- Exchange hashpartitioning(a#251428, 5), true, > [id=#504663] > : +- *(2) Filter (b#251429 = 1) > :+- *(2) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#251428, > knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#251429] > : +- Scan[obj#251427] > +- BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).l, true, false) > AS l#251499] > : +- Scan[obj#251497] > +- BroadcastQueryStage 6 > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, > int, false] as bigint))), [id=#504830] >+- CustomShuffleReader local > +- ShuffleQueryStage 3 > +- Exchange hashpartitioning(a#251438, 5), true, [id=#504694] > +- *(4) Filter (a#251438 = 1) >+- *(4) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).a AS a#251438, > unwrapoption(IntegerType, knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$TestData3, true])).b) AS b#251439] > +- Scan[obj#251437] > , BroadcastHashJoin [n#251498], [a#251438], Inner, BuildRight > :- CustomShuffleReader local > : +- ShuffleQueryStage 2 > : +- Exchange hashpartitioning(n#251498, 5), true, [id=#504680] > :+- *(3) Filter (n#251498 = 1) > : +- *(3) SerializeFromObject > [knownnotnull(assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$LowerCaseData, true])).n AS n#251498, >
[jira] [Updated] (SPARK-35414) Completely fix the broadcast timeout issue in AQE
[ https://issues.apache.org/jira/browse/SPARK-35414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35414: -- Parent: SPARK-37063 Issue Type: Sub-task (was: Bug) > Completely fix the broadcast timeout issue in AQE > - > > Key: SPARK-35414 > URL: https://issues.apache.org/jira/browse/SPARK-35414 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Yu Zhong >Assignee: Yu Zhong >Priority: Major > > SPARK-33933 report a issue that in AQE, when the resources is limited, > broadcast timeout could happened. > [#31269|https://github.com/apache/spark/pull/31269] gives a partial fix by > reorder newStages by class type to make sure BroadcastQueryState precede > others when calling materialized(). However, it only guarantee that the order > of task to be scheduled in normal circumstances, but, the guarantee is not > strict since the submit of broadcast job and shuffle map job are in different > thread. > So we need a completely fix to avoid the edge case triggering broadcast > timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35414) Completely fix the broadcast timeout issue in AQE
[ https://issues.apache.org/jira/browse/SPARK-35414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35414: -- Parent: (was: SPARK-33828) Issue Type: Bug (was: Sub-task) > Completely fix the broadcast timeout issue in AQE > - > > Key: SPARK-35414 > URL: https://issues.apache.org/jira/browse/SPARK-35414 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Yu Zhong >Assignee: Yu Zhong >Priority: Major > > SPARK-33933 report a issue that in AQE, when the resources is limited, > broadcast timeout could happened. > [#31269|https://github.com/apache/spark/pull/31269] gives a partial fix by > reorder newStages by class type to make sure BroadcastQueryState precede > others when calling materialized(). However, it only guarantee that the order > of task to be scheduled in normal circumstances, but, the guarantee is not > strict since the submit of broadcast job and shuffle map job are in different > thread. > So we need a completely fix to avoid the edge case triggering broadcast > timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34827: -- Parent: (was: SPARK-33828) Issue Type: Bug (was: Sub-task) > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34827: -- Parent: SPARK-37063 Issue Type: Sub-task (was: Bug) > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37063) SQL Adaptive Query Execution QA: Phase 2
Dongjoon Hyun created SPARK-37063: - Summary: SQL Adaptive Query Execution QA: Phase 2 Key: SPARK-37063 URL: https://issues.apache.org/jira/browse/SPARK-37063 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36913) Implement createIndex and IndexExists in JDBC (MySQL dialect)
[ https://issues.apache.org/jira/browse/SPARK-36913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430780#comment-17430780 ] Reynold Xin commented on SPARK-36913: - My concern is not about JDBC (I should've commented on the parent ticket). My concern is that there are *a lot* of RDBMS features and we can't possibly support all of them. It seems like we'd be much better off just having a generic fallback API to execute a command that's passed through by Spark to the underlying data source, and then the underlying data source can decide what to do. Otherwise we will have to add create index, define foreign key, define sequence objects, and 50 other DDL commands in Spark. > Implement createIndex and IndexExists in JDBC (MySQL dialect) > - > > Key: SPARK-36913 > URL: https://issues.apache.org/jira/browse/SPARK-36913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36913) Implement createIndex and IndexExists in JDBC (MySQL dialect)
[ https://issues.apache.org/jira/browse/SPARK-36913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430778#comment-17430778 ] Huaxin Gao commented on SPARK-36913: [~rxin] Hi Reynold, thank you for taking a look into this and sharing your concerns. Are you concerning about the interface SupportsIndex or the JDBC implementation? The interface I added is very generic and it's up to the data source to implement it. The JDBC implementation just provides me a simple way to do an end to end testing so I don't have to implement the new interface in InMemoryTable to do the test. > Implement createIndex and IndexExists in JDBC (MySQL dialect) > - > > Key: SPARK-36913 > URL: https://issues.apache.org/jira/browse/SPARK-36913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29497) Cannot assign instance of java.lang.invoke.SerializedLambda to field
[ https://issues.apache.org/jira/browse/SPARK-29497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29497: -- Affects Version/s: 3.2.0 > Cannot assign instance of java.lang.invoke.SerializedLambda to field > > > Key: SPARK-29497 > URL: https://issues.apache.org/jira/browse/SPARK-29497 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3, 3.0.1, 3.2.0 > Environment: Spark 2.4.3 Scala 2.12 >Reporter: Rob Russo >Priority: Major > > Note this is for scala 2.12: > There seems to be an issue in spark with serializing a udf that is created > from a function assigned to a class member that references another function > assigned to a class member. This is similar to > https://issues.apache.org/jira/browse/SPARK-25047 but it looks like the > resolution has an issue with this case. After trimming it down to the base > issue I came up with the following to reproduce: > > > {code:java} > object TestLambdaShell extends Serializable { > val hello: String => String = s => s"hello $s!" > val lambdaTest: String => String = hello( _ ) > def functionTest: String => String = hello( _ ) > } > val hello = udf( TestLambdaShell.hello ) > val functionTest = udf( TestLambdaShell.functionTest ) > val lambdaTest = udf( TestLambdaShell.lambdaTest ) > sc.parallelize(Seq("world"),1).toDF("test").select(hello($"test")).show(1) > sc.parallelize(Seq("world"),1).toDF("test").select(functionTest($"test")).show(1) > sc.parallelize(Seq("world"),1).toDF("test").select(lambdaTest($"test")).show(1) > {code} > > All of which works except the last line which results in an exception on the > executors: > > {code:java} > Caused by: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$.lambdaTest of type > scala.Function1 in instance of > $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$ > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133) > at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) > at >
[jira] [Updated] (SPARK-29497) Cannot assign instance of java.lang.invoke.SerializedLambda to field
[ https://issues.apache.org/jira/browse/SPARK-29497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29497: -- Environment: Spark 2.4.3 Scala 2.12 Spark 3.2.0 Scala 2.13.5 (Java 11.0.12) was:Spark 2.4.3 Scala 2.12 > Cannot assign instance of java.lang.invoke.SerializedLambda to field > > > Key: SPARK-29497 > URL: https://issues.apache.org/jira/browse/SPARK-29497 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3, 3.0.1, 3.2.0 > Environment: Spark 2.4.3 Scala 2.12 > Spark 3.2.0 Scala 2.13.5 (Java 11.0.12) >Reporter: Rob Russo >Priority: Major > > Note this is for scala 2.12: > There seems to be an issue in spark with serializing a udf that is created > from a function assigned to a class member that references another function > assigned to a class member. This is similar to > https://issues.apache.org/jira/browse/SPARK-25047 but it looks like the > resolution has an issue with this case. After trimming it down to the base > issue I came up with the following to reproduce: > > > {code:java} > object TestLambdaShell extends Serializable { > val hello: String => String = s => s"hello $s!" > val lambdaTest: String => String = hello( _ ) > def functionTest: String => String = hello( _ ) > } > val hello = udf( TestLambdaShell.hello ) > val functionTest = udf( TestLambdaShell.functionTest ) > val lambdaTest = udf( TestLambdaShell.lambdaTest ) > sc.parallelize(Seq("world"),1).toDF("test").select(hello($"test")).show(1) > sc.parallelize(Seq("world"),1).toDF("test").select(functionTest($"test")).show(1) > sc.parallelize(Seq("world"),1).toDF("test").select(lambdaTest($"test")).show(1) > {code} > > All of which works except the last line which results in an exception on the > executors: > > {code:java} > Caused by: java.lang.ClassCastException: cannot assign instance of > java.lang.invoke.SerializedLambda to field > $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$.lambdaTest of type > scala.Function1 in instance of > $$$82b5b23cea489b2712a1db46c77e458w$TestLambdaShell$ > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133) > at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1933) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at
[jira] [Commented] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch
[ https://issues.apache.org/jira/browse/SPARK-37062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430761#comment-17430761 ] Jungtaek Lim commented on SPARK-37062: -- will submit a PR for this sooner. > Introduce a new data source for providing consistent set of rows per > microbatch > --- > > Key: SPARK-37062 > URL: https://issues.apache.org/jira/browse/SPARK-37062 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Jungtaek Lim >Priority: Major > > The "rate" data source has been known to be used as a benchmark for streaming > query. > While this helps to put the query to the limit (how many rows the query could > process per second), the rate data source doesn't provide consistent rows per > batch into stream, which leads two environments be hard to compare with. > For example, in many cases, you may want to compare the metrics in the > batches between test environments (like running same streaming query with > different options). These metrics are strongly affected if the distribution > of input rows in batches are changing, especially a micro-batch has been > lagged (in any reason) and rate data source produces more input rows to the > next batch. > Also, when you test against streaming aggregation, you may want the data > source produces the same set of input rows per batch (deterministic), so that > you can plan how these input rows will be aggregated and how state rows will > be evicted, and craft the test query based on the plan. > The requirements of new data source would follow: > * it should produce a specific number of input rows as requested > * it should also include a timestamp (event time) into each row > ** to make the input rows fully deterministic, timestamp should be configured > as well (like start timestamp & amount of advance per batch) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37062) Introduce a new data source for providing consistent set of rows per microbatch
Jungtaek Lim created SPARK-37062: Summary: Introduce a new data source for providing consistent set of rows per microbatch Key: SPARK-37062 URL: https://issues.apache.org/jira/browse/SPARK-37062 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 3.3.0 Reporter: Jungtaek Lim The "rate" data source has been known to be used as a benchmark for streaming query. While this helps to put the query to the limit (how many rows the query could process per second), the rate data source doesn't provide consistent rows per batch into stream, which leads two environments be hard to compare with. For example, in many cases, you may want to compare the metrics in the batches between test environments (like running same streaming query with different options). These metrics are strongly affected if the distribution of input rows in batches are changing, especially a micro-batch has been lagged (in any reason) and rate data source produces more input rows to the next batch. Also, when you test against streaming aggregation, you may want the data source produces the same set of input rows per batch (deterministic), so that you can plan how these input rows will be aggregated and how state rows will be evicted, and craft the test query based on the plan. The requirements of new data source would follow: * it should produce a specific number of input rows as requested * it should also include a timestamp (event time) into each row ** to make the input rows fully deterministic, timestamp should be configured as well (like start timestamp & amount of advance per batch) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37061) Custom V2 Metrics uses wrong classname for lookup
[ https://issues.apache.org/jira/browse/SPARK-37061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Spitzer updated SPARK-37061: Description: Currently CustomMetrics uses `getCanonicalName` to get the metric type name https://github.com/apache/spark/blob/38493401d18d42a6cb176bf515536af97ba1338b/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala#L31-L33 But when using reflection we need to use the original type name. Here is an example when working with an inner class {code:java title="CanonicalName vs Name"} Class.getName = org.apache.iceberg.spark.source.SparkBatchScan$FilesScannedMetric Class.getCanonicalName = org.apache.iceberg.spark.source.SparkBatchScan.FilesScannedMetric {code} The "$" name is required to look up this class while the "." version will fail with CNF. was: Currently CustomMetrics uses `getCanonicalName` to get the metric type name https://github.com/apache/spark/blob/38493401d18d42a6cb176bf515536af97ba1338b/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala#L31-L33 But when using reflection we need to use the original type name. Here is an example when working with an inner class ``` Class.getName = org.apache.iceberg.spark.source.SparkBatchScan$FilesScannedMetric Class.getCanonicalName = org.apache.iceberg.spark.source.SparkBatchScan.FilesScannedMetric ``` The "$" name is required to look up this class while the "." version will fail with CNF. > Custom V2 Metrics uses wrong classname for lookup > - > > Key: SPARK-37061 > URL: https://issues.apache.org/jira/browse/SPARK-37061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Russell Spitzer >Priority: Minor > > Currently CustomMetrics uses `getCanonicalName` to get the metric type name > https://github.com/apache/spark/blob/38493401d18d42a6cb176bf515536af97ba1338b/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala#L31-L33 > But when using reflection we need to use the original type name. > Here is an example when working with an inner class > {code:java title="CanonicalName vs Name"} > Class.getName = > org.apache.iceberg.spark.source.SparkBatchScan$FilesScannedMetric > Class.getCanonicalName = > org.apache.iceberg.spark.source.SparkBatchScan.FilesScannedMetric > {code} > The "$" name is required to look up this class while the "." version will > fail with CNF. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37061) Custom V2 Metrics uses wrong classname in for lookup
Russell Spitzer created SPARK-37061: --- Summary: Custom V2 Metrics uses wrong classname in for lookup Key: SPARK-37061 URL: https://issues.apache.org/jira/browse/SPARK-37061 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Russell Spitzer Currently CustomMetrics uses `getCanonicalName` to get the metric type name https://github.com/apache/spark/blob/38493401d18d42a6cb176bf515536af97ba1338b/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala#L31-L33 But when using reflection we need to use the original type name. Here is an example when working with an inner class ``` Class.getName = org.apache.iceberg.spark.source.SparkBatchScan$FilesScannedMetric Class.getCanonicalName = org.apache.iceberg.spark.source.SparkBatchScan.FilesScannedMetric ``` The "$" name is required to look up this class while the "." version will fail with CNF. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37061) Custom V2 Metrics uses wrong classname for lookup
[ https://issues.apache.org/jira/browse/SPARK-37061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Spitzer updated SPARK-37061: Summary: Custom V2 Metrics uses wrong classname for lookup (was: Custom V2 Metrics uses wrong classname in for lookup) > Custom V2 Metrics uses wrong classname for lookup > - > > Key: SPARK-37061 > URL: https://issues.apache.org/jira/browse/SPARK-37061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Russell Spitzer >Priority: Minor > > Currently CustomMetrics uses `getCanonicalName` to get the metric type name > https://github.com/apache/spark/blob/38493401d18d42a6cb176bf515536af97ba1338b/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala#L31-L33 > But when using reflection we need to use the original type name. > Here is an example when working with an inner class > ``` > Class.getName = > org.apache.iceberg.spark.source.SparkBatchScan$FilesScannedMetric > Class.getCanonicalName = > org.apache.iceberg.spark.source.SparkBatchScan.FilesScannedMetric > ``` > The "$" name is required to look up this class while the "." version will > fail with CNF. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37044) Add Row to __all__ in pyspark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430675#comment-17430675 ] Apache Spark commented on SPARK-37044: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/34332 > Add Row to __all__ in pyspark.sql.types > --- > > Key: SPARK-37044 > URL: https://issues.apache.org/jira/browse/SPARK-37044 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from > {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import > *}} won't import {{Row}}. > It might be counter-intuitive, especially when we import {{Row}} from > {{types}} in {{examples}}. > Should we add it to {{__all__}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37044) Add Row to __all__ in pyspark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37044: Assignee: (was: Apache Spark) > Add Row to __all__ in pyspark.sql.types > --- > > Key: SPARK-37044 > URL: https://issues.apache.org/jira/browse/SPARK-37044 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from > {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import > *}} won't import {{Row}}. > It might be counter-intuitive, especially when we import {{Row}} from > {{types}} in {{examples}}. > Should we add it to {{__all__}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37044) Add Row to __all__ in pyspark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430673#comment-17430673 ] Apache Spark commented on SPARK-37044: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/34332 > Add Row to __all__ in pyspark.sql.types > --- > > Key: SPARK-37044 > URL: https://issues.apache.org/jira/browse/SPARK-37044 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from > {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import > *}} won't import {{Row}}. > It might be counter-intuitive, especially when we import {{Row}} from > {{types}} in {{examples}}. > Should we add it to {{__all__}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37044) Add Row to __all__ in pyspark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-37044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37044: Assignee: Apache Spark > Add Row to __all__ in pyspark.sql.types > --- > > Key: SPARK-37044 > URL: https://issues.apache.org/jira/browse/SPARK-37044 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0, 3.2.0, 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Minor > > Currently {{Row}}, defined in {{pyspark.sql.types}} is exported from > {{pyspark.sql}} but not types. It means that {{from pyspark.sql.types import > *}} won't import {{Row}}. > It might be counter-intuitive, especially when we import {{Row}} from > {{types}} in {{examples}}. > Should we add it to {{__all__}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours
[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-33152: - Description: h2. Q1. What are you trying to do? Articulate your objectives using absolutely no jargon. Proposing new algorithm to create, store and use constraints for removing redundant filters & inferring new filters. The current algorithm has subpar performance in complex expression scenarios involving aliases( with certain use cases the compilation time can go into hours), potential to cause OOM, may miss removing redundant filters in different scenarios, may miss creating IsNotNull constraints in different scenarios, does not push compound predicates in Join. # This issue if not fixed can cause OutOfMemory issue or unacceptable query compilation times. Have added a test "plan equivalence with case statements and performance comparison with benefit of more than 10x conservatively" in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. *With this PR the compilation time is 247 ms vs 13958 ms without the change* # It is more effective in filter pruning as is evident in some of the tests in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite where current code is not able to identify the redundant filter in some cases. # It is able to generate a better optimized plan for join queries as it can push compound predicates. # The current logic can miss a lot of possible cases of removing redundant predicates, as it fails to take into account if same attribute or its aliases are repeated multiple times in a complex expression. # There are cases where some of the optimizer rules involving removal of redundant predicates fail to remove on the basis of constraint data. In some cases the rule works, just by the virtue of previous rules helping it out to cover the inaccuracy. That the ConstraintPropagation rule & its function of removal of redundant filters & addition of new inferred filters is dependent on the working of some of the other unrelated previous optimizer rules is behaving, is indicative of issues. # It does away with all the EqualNullSafe constraints as this logic does not need those constraints to be created. # There is at least one test in existing ConstraintPropagationSuite which is missing a IsNotNull constraints because the code incorrectly generated a EqualsNullSafeConstraint instead of EqualTo constraint, when using the existing Constraints code. With these changes, the test correctly creates an EqualTo constraint, resulting in an inferred IsNotNull constraint # It does away with the current combinatorial logic of evaluation all the constraints can cause compilation to run into hours or cause OOM. The number of constraints stored is exactly the same as the number of filters encountered h2. Q2. What problem is this proposal NOT designed to solve? It mainly focuses on compile time performance, but in some cases can benefit run time characteristics too, like inferring IsNotNull filter or pushing down compound predicates on the join, which currently may get missed/ does not happen , respectively, by the present code. h2. Q3. How is it done today, and what are the limits of current practice? Current ConstraintsPropagation code, pessimistically tries to generates all the possible combinations of constraints , based on the aliases ( even then it may miss a lot of combinations if the expression is a complex expression involving same attribute repeated multiple times within the expression and there are many aliases to that column). There are query plans in our production env, which can result in intermediate number of constraints going into hundreds of thousands, causing OOM or taking time running into hours. Also there are cases where it incorrectly generates an EqualNullSafe constraint instead of EqualTo constraint , thus missing a possible IsNull constraint on column. Also it only pushes single column predicate on the other side of the join. The constraints generated , in some cases, are missing the required ones, and the plan apparently is behaving correctly only due to the preceding unrelated optimizer rule. Have Test which show that with the bare mnimum rules containing RemoveRedundantPredicate, it misses the removal of redundant predicate. h2. Q4. What is new in your approach and why do you think it will be successful? It solves all the above mentioned issues. # The number of constraints created are same as the number of filters. No combinatorial creation of constraints. No need for EqualsNullSafe constraint on aliases. # Can remove redundant predicates on any expression involving aliases irrespective of the number of repeat occurences in all possible combination. # Brings down query compilation time to few minutes from hours. # Can push compound predicates on Joins & infer right number of IsNotNull constraints
[jira] [Updated] (SPARK-37060) Report driver status does not handle response from backup masters
[ https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohamadreza Rostami updated SPARK-37060: Description: After an improvement in SPARK-31486, contributor uses 'asyncSendToMasterAndForwardReply' method instead of 'activeMasterEndpoint.askSync' to get the status of driver. Since the driver's status is only available in active master and the 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we have to handle the response from the backup masters in the client, which the developer did not consider in the SPARK-31486 change. So drivers running in cluster mode and on a cluster with multi-master affected by this bug. I created the patch for this bug and will soon be sent the pull request. (was: After a improvement in SPARK-31486, contributor uses 'asyncSendToMasterAndForwardReply' method instead of 'activeMasterEndpoint.askSync' to get the status of driver. Since the driver's status is only available inactive master and the 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we have to handle the response from the backup masters in the client, which the developer did not consider in the SPARK-31486 change. So drivers running in cluster mode and on a cluster with multi-master affected by this bug. I created the patch for this bug and will soon be sent the pull request.) > Report driver status does not handle response from backup masters > - > > Key: SPARK-37060 > URL: https://issues.apache.org/jira/browse/SPARK-37060 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Mohamadreza Rostami >Priority: Major > > After an improvement in SPARK-31486, contributor uses > 'asyncSendToMasterAndForwardReply' method instead of > 'activeMasterEndpoint.askSync' to get the status of driver. Since the > driver's status is only available in active master and the > 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we > have to handle the response from the backup masters in the client, which the > developer did not consider in the SPARK-31486 change. So drivers running in > cluster mode and on a cluster with multi-master affected by this bug. I > created the patch for this bug and will soon be sent the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours
[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-33152: - Description: h2. Q1. What are you trying to do? Articulate your objectives using absolutely no jargon. Proposing new algorithm to create, store and use constraints for removing redundant filters & inferring new filters. The current algorithm has subpar performance in complex expression scenarios involving aliases( with certain use cases the compilation time can go into hours), potential to cause OOM, may miss removing redundant filters in different scenarios, may miss creating IsNotNull constraints in different scenarios, does not push compound predicates in Join. # This issue if not fixed can cause OutOfMemory issue or unacceptable query compilation times. Have added a test "plan equivalence with case statements and performance comparison with benefit of more than 10x conservatively" in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. With this PR the compilation time is 247 ms vs 13958 ms without the change # It is more effective in filter pruning as is evident in some of the tests in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite where current code is not able to identify the redundant filter in some cases. # It is able to generate a better optimized plan for join queries as it can push compound predicates. # The current logic can miss a lot of possible cases of removing redundant predicates, as it fails to take into account if same attribute or its aliases are repeated multiple times in a complex expression. # There are cases where some of the optimizer rules involving removal of redundant predicates fail to remove on the basis of constraint data. In some cases the rule works, just by the virtue of previous rules helping it out to cover the inaccuracy. That the ConstraintPropagation rule & its function of removal of redundant filters & addition of new inferred filters is dependent on the working of some of the other unrelated previous optimizer rules is behaving, is indicative of issues. # It does away with all the EqualNullSafe constraints as this logic does not need those constraints to be created. # There is at least one test in existing ConstraintPropagationSuite which is missing a IsNotNull constraints because the code incorrectly generated a EqualsNullSafeConstraint instead of EqualTo constraint, when using the existing Constraints code. With these changes, the test correctly creates an EqualTo constraint, resulting in an inferred IsNotNull constraint # It does away with the current combinatorial logic of evaluation all the constraints can cause compilation to run into hours or cause OOM. The number of constraints stored is exactly the same as the number of filters encountered h2. Q2. What problem is this proposal NOT designed to solve? It mainly focuses on compile time performance, but in some cases can benefit run time characteristics too, like inferring IsNotNull filter or pushing down compound predicates on the join, which currently may get missed/ does not happen , respectively, by the present code. h2. Q3. How is it done today, and what are the limits of current practice? Current ConstraintsPropagation code, pessimistically tries to generates all the possible combinations of constraints , based on the aliases ( even then it may miss a lot of combinations if the expression is a complex expression involving same attribute repeated multiple times within the expression and there are many aliases to that column). There are query plans in our production env, which can result in intermediate number of constraints going into hundreds of thousands, causing OOM or taking time running into hours. Also there are cases where it incorrectly generates an EqualNullSafe constraint instead of EqualTo constraint , thus missing a possible IsNull constraint on column. Also it only pushes single column predicate on the other side of the join. The constraints generated , in some cases, are missing the required ones, and the plan apparently is behaving correctly only due to the preceding unrelated optimizer rule. Have Test which show that with the bare mnimum rules containing RemoveRedundantPredicate, it misses the removal of redundant predicate. h2. Q4. What is new in your approach and why do you think it will be successful? It solves all the above mentioned issues. # The number of constraints created are same as the number of filters. No combinatorial creation of constraints. No need for EqualsNullSafe constraint on aliases. # Can remove redundant predicates on any expression involving aliases irrespective of the number of repeat occurences in all possible combination. # Brings down query compilation time to few minutes from hours. # Can push compound predicates on Joins & infer right number of IsNotNull constraints
[jira] [Assigned] (SPARK-37060) Report driver status does not handle response from backup masters
[ https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37060: Assignee: (was: Apache Spark) > Report driver status does not handle response from backup masters > - > > Key: SPARK-37060 > URL: https://issues.apache.org/jira/browse/SPARK-37060 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Mohamadreza Rostami >Priority: Major > > After a improvement in SPARK-31486, contributor uses > 'asyncSendToMasterAndForwardReply' method instead of > 'activeMasterEndpoint.askSync' to get the status of driver. Since the > driver's status is only available inactive master and the > 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we > have to handle the response from the backup masters in the client, which the > developer did not consider in the SPARK-31486 change. So drivers running in > cluster mode and on a cluster with multi-master affected by this bug. I > created the patch for this bug and will soon be sent the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37060) Report driver status does not handle response from backup masters
[ https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430613#comment-17430613 ] Apache Spark commented on SPARK-37060: -- User 'mohamadrezarostami' has created a pull request for this issue: https://github.com/apache/spark/pull/34331 > Report driver status does not handle response from backup masters > - > > Key: SPARK-37060 > URL: https://issues.apache.org/jira/browse/SPARK-37060 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Mohamadreza Rostami >Priority: Major > > After a improvement in SPARK-31486, contributor uses > 'asyncSendToMasterAndForwardReply' method instead of > 'activeMasterEndpoint.askSync' to get the status of driver. Since the > driver's status is only available inactive master and the > 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we > have to handle the response from the backup masters in the client, which the > developer did not consider in the SPARK-31486 change. So drivers running in > cluster mode and on a cluster with multi-master affected by this bug. I > created the patch for this bug and will soon be sent the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37060) Report driver status does not handle response from backup masters
[ https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37060: Assignee: Apache Spark > Report driver status does not handle response from backup masters > - > > Key: SPARK-37060 > URL: https://issues.apache.org/jira/browse/SPARK-37060 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Mohamadreza Rostami >Assignee: Apache Spark >Priority: Major > > After a improvement in SPARK-31486, contributor uses > 'asyncSendToMasterAndForwardReply' method instead of > 'activeMasterEndpoint.askSync' to get the status of driver. Since the > driver's status is only available inactive master and the > 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we > have to handle the response from the backup masters in the client, which the > developer did not consider in the SPARK-31486 change. So drivers running in > cluster mode and on a cluster with multi-master affected by this bug. I > created the patch for this bug and will soon be sent the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37060) Report driver status does not handle response from backup masters
[ https://issues.apache.org/jira/browse/SPARK-37060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohamadreza Rostami updated SPARK-37060: Description: After a improvement in SPARK-31486, contributor uses 'asyncSendToMasterAndForwardReply' method instead of 'activeMasterEndpoint.askSync' to get the status of driver. Since the driver's status is only available inactive master and the 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we have to handle the response from the backup masters in the client, which the developer did not consider in the SPARK-31486 change. So drivers running in cluster mode and on a cluster with multi-master affected by this bug. I created the patch for this bug and will soon be sent the pull request. (was: After a improvement in SPARK-31486, contributor uses 'asyncSendToMasterAndForwardReply' method instead of 'activeMasterEndpoint.askSync' to get the status of driver. Since the driver's status is only available inactive master and the 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we have to handle the response from the backup masters in the client, which the developer did not consider in the SPARK-31486 change. So drivers running in cluster mode and on a cluster with multi-master affected by this bug.) > Report driver status does not handle response from backup masters > - > > Key: SPARK-37060 > URL: https://issues.apache.org/jira/browse/SPARK-37060 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Mohamadreza Rostami >Priority: Major > > After a improvement in SPARK-31486, contributor uses > 'asyncSendToMasterAndForwardReply' method instead of > 'activeMasterEndpoint.askSync' to get the status of driver. Since the > driver's status is only available inactive master and the > 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we > have to handle the response from the backup masters in the client, which the > developer did not consider in the SPARK-31486 change. So drivers running in > cluster mode and on a cluster with multi-master affected by this bug. I > created the patch for this bug and will soon be sent the pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37060) Report driver status does not handle response from backup masters
Mohamadreza Rostami created SPARK-37060: --- Summary: Report driver status does not handle response from backup masters Key: SPARK-37060 URL: https://issues.apache.org/jira/browse/SPARK-37060 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0, 3.1.2, 3.1.1, 3.1.0 Reporter: Mohamadreza Rostami After a improvement in SPARK-31486, contributor uses 'asyncSendToMasterAndForwardReply' method instead of 'activeMasterEndpoint.askSync' to get the status of driver. Since the driver's status is only available inactive master and the 'asyncSendToMasterAndForwardReply' method iterate over all of the masters, we have to handle the response from the backup masters in the client, which the developer did not consider in the SPARK-31486 change. So drivers running in cluster mode and on a cluster with multi-master affected by this bug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36796) Make all unit tests pass on Java 17
[ https://issues.apache.org/jira/browse/SPARK-36796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-36796. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34153 [https://github.com/apache/spark/pull/34153] > Make all unit tests pass on Java 17 > --- > > Key: SPARK-36796 > URL: https://issues.apache.org/jira/browse/SPARK-36796 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Yang Jie >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36796) Make all unit tests pass on Java 17
[ https://issues.apache.org/jira/browse/SPARK-36796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-36796: Assignee: Yang Jie > Make all unit tests pass on Java 17 > --- > > Key: SPARK-36796 > URL: https://issues.apache.org/jira/browse/SPARK-36796 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests
[ https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37059: Assignee: Apache Spark (was: Kousuke Saruta) > Ensure the sort order of the output in the PySpark doctests > --- > > Key: SPARK-37059 > URL: https://issues.apache.org/jira/browse/SPARK-37059 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > The collect_set builtin function doesn't ensure the sort order of its result > for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order > of the result rows. > Nevertheless, their PySpark doctests assume a certain kind of sort order, > causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests
[ https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430551#comment-17430551 ] Apache Spark commented on SPARK-37059: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34330 > Ensure the sort order of the output in the PySpark doctests > --- > > Key: SPARK-37059 > URL: https://issues.apache.org/jira/browse/SPARK-37059 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > The collect_set builtin function doesn't ensure the sort order of its result > for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order > of the result rows. > Nevertheless, their PySpark doctests assume a certain kind of sort order, > causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests
[ https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37059: Assignee: Kousuke Saruta (was: Apache Spark) > Ensure the sort order of the output in the PySpark doctests > --- > > Key: SPARK-37059 > URL: https://issues.apache.org/jira/browse/SPARK-37059 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > The collect_set builtin function doesn't ensure the sort order of its result > for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order > of the result rows. > Nevertheless, their PySpark doctests assume a certain kind of sort order, > causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests
[ https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-37059: --- Description: The collect_set builtin function doesn't ensure the sort order of its result for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order of the result rows. Nevertheless, their PySpark doctests assume a certain kind of sort order, causing that such doctests fail with Scala 2.13. was: The collect_set builtin function doesn't ensure the sort order of its result for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order of the result rows. Nevertheless, their doctests for PySpark assume a certain kind of sort order, causing that such doctests fail with Scala 2.13. > Ensure the sort order of the output in the PySpark doctests > --- > > Key: SPARK-37059 > URL: https://issues.apache.org/jira/browse/SPARK-37059 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > The collect_set builtin function doesn't ensure the sort order of its result > for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order > of the result rows. > Nevertheless, their PySpark doctests assume a certain kind of sort order, > causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests
[ https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-37059: --- Description: The collect_set builtin function doesn't ensure the sort order of its result for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order of the result rows. Nevertheless, thier doctests for PySpark assume a certain kind of sort order, causing that such doctests fail with Scala 2.13. was: The collect_set builtin function doesn't ensure the sort order of its result for each row. FPGrouthModel.freqItemsets also doesn' ensure the sort order of the result rows. Nevertheless, thier doctests for PySpark assume a certain kind of sort order, causing that such doctests fail with Scala 2.13. > Ensure the sort order of the output in the PySpark doctests > --- > > Key: SPARK-37059 > URL: https://issues.apache.org/jira/browse/SPARK-37059 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > The collect_set builtin function doesn't ensure the sort order of its result > for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order > of the result rows. > Nevertheless, thier doctests for PySpark assume a certain kind of sort order, > causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests
[ https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-37059: --- Description: The collect_set builtin function doesn't ensure the sort order of its result for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order of the result rows. Nevertheless, their doctests for PySpark assume a certain kind of sort order, causing that such doctests fail with Scala 2.13. was: The collect_set builtin function doesn't ensure the sort order of its result for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order of the result rows. Nevertheless, thier doctests for PySpark assume a certain kind of sort order, causing that such doctests fail with Scala 2.13. > Ensure the sort order of the output in the PySpark doctests > --- > > Key: SPARK-37059 > URL: https://issues.apache.org/jira/browse/SPARK-37059 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > The collect_set builtin function doesn't ensure the sort order of its result > for each row. FPGrouthModel.freqItemsets also doesn't ensure the sort order > of the result rows. > Nevertheless, their doctests for PySpark assume a certain kind of sort order, > causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests
[ https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-37059: --- Description: The collect_set builtin function doesn't ensure the sort order of its result for each row. FPGrouthModel.freqItemsets also doesn' ensure the sort order of the result rows. Nevertheless, thier doctests for PySpark assume a certain kind of sort order, causing that such doctests fail with Scala 2.13. was: The collect_set builtin function doesn't ensure the sort order of its result. FPGrouthModel.freqItemsets also doesn' ensure the sort order of the result rows. Nevertheless, thier doctests for PySpark assume a certain kind of sort order, causing that such doctests fail with Scala 2.13. > Ensure the sort order of the output in the PySpark doctests > --- > > Key: SPARK-37059 > URL: https://issues.apache.org/jira/browse/SPARK-37059 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > The collect_set builtin function doesn't ensure the sort order of its result > for each row. FPGrouthModel.freqItemsets also doesn' ensure the sort order of > the result rows. > Nevertheless, thier doctests for PySpark assume a certain kind of sort order, > causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests
[ https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-37059: --- Component/s: Tests > Ensure the sort order of the output in the PySpark doctests > --- > > Key: SPARK-37059 > URL: https://issues.apache.org/jira/browse/SPARK-37059 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > The collect_set builtin function doesn't ensure the sort order of its result. > FPGrouthModel.freqItemsets also doesn' ensure the sort order of the result > rows. > Nevertheless, thier doctests for PySpark assume a certain kind of sort order, > causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37059) Ensure the sort order of the output in the PySpark doctests
[ https://issues.apache.org/jira/browse/SPARK-37059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-37059: --- Summary: Ensure the sort order of the output in the PySpark doctests (was: Ensure the sort order of the output in the PySpark examples) > Ensure the sort order of the output in the PySpark doctests > --- > > Key: SPARK-37059 > URL: https://issues.apache.org/jira/browse/SPARK-37059 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > The collect_set builtin function doesn't ensure the sort order of its result. > FPGrouthModel.freqItemsets also doesn' ensure the sort order of the result > rows. > Nevertheless, thier doctests for PySpark assume a certain kind of sort order, > causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37059) Ensure the sort order of the output in the PySpark examples
Kousuke Saruta created SPARK-37059: -- Summary: Ensure the sort order of the output in the PySpark examples Key: SPARK-37059 URL: https://issues.apache.org/jira/browse/SPARK-37059 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta The collect_set builtin function doesn't ensure the sort order of its result. FPGrouthModel.freqItemsets also doesn' ensure the sort order of the result rows. Nevertheless, thier doctests for PySpark assume a certain kind of sort order, causing that such doctests fail with Scala 2.13. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37058) Add spark-shell command line unit test
[ https://issues.apache.org/jira/browse/SPARK-37058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37058: Assignee: (was: Apache Spark) > Add spark-shell command line unit test > -- > > Key: SPARK-37058 > URL: https://issues.apache.org/jira/browse/SPARK-37058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Add spark-shell command line unit test -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37058) Add spark-shell command line unit test
[ https://issues.apache.org/jira/browse/SPARK-37058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37058: Assignee: Apache Spark > Add spark-shell command line unit test > -- > > Key: SPARK-37058 > URL: https://issues.apache.org/jira/browse/SPARK-37058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Add spark-shell command line unit test -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37058) Add spark-shell command line unit test
[ https://issues.apache.org/jira/browse/SPARK-37058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430484#comment-17430484 ] Apache Spark commented on SPARK-37058: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34329 > Add spark-shell command line unit test > -- > > Key: SPARK-37058 > URL: https://issues.apache.org/jira/browse/SPARK-37058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Add spark-shell command line unit test -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] frankli updated SPARK-37051: Description: When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. Data is inserted by hive, and queried by Spark. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. Luckily, the varchar type is OK. This bug can be reproduced by a few steps. >>> desc t2_orc; ++---+++ |col_name|data_type|comment| ++---+++ |a|string |NULL| |b|char(50) |NULL| |c|int |NULL| ++---++--–+ >>> select * from t2_orc where a='a'; +-+---++--+ |a|b|c| +-+---++--+ |a|b|1| |a|b|2| |a|b|3| |a|b|4| |a|b|5| +-+---++–+ >>> select * from t2_orc where b='b'; +-+---++--+ |a|b|c| +-+---++--+ +-+---++--+ By the way, Spark's tests should add more cases on the char type. == Physical Plan == CollectLimit (3) +- Filter (2) +- Scan orc tpcds_bin_partitioned_orc_2.item (1) (1) Scan orc tpcds_bin_partitioned_orc_2.item Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Batched: false Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music )] ReadSchema: struct (2) Filter Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ (3) CollectLimit Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, i_units#18, i_container#19, i_manager_id#20, i_product_name#21] Arguments: 100 was: When I try the following sample SQL on the TPCDS data, the filter operator returns an empty row set (shown in web ui). _select * from item where i_category = 'Music' limit 100;_ The table is in ORC format, and i_category is char(50) type. I guest that the char(50) type will remains redundant blanks after the actual word. It will affect the boolean value of "x.equals(Y)", and results in wrong results. Luckily, the varchar type is OK. This bug can be reproduced by a few steps. >>> desc t2_orc; +---++--+--+ | col_name | data_type | comment | +---++--+--+ | a | string | NULL | | b | char(50) | NULL | | c | int | NULL | +---++--+–+ >>> select * from t2_orc where a='a'; ++++--+ | a | b | c | ++++--+ | a | b | 1 | | a | b | 2 | | a | b | 3 | | a | b | 4 | | a | b | 5 | ++++–+ >>> select * from t2_orc where b='b'; ++++--+ | a | b | c | ++++--+ ++++--+ By the way, Spark's tests should add more cases on the
[jira] [Commented] (SPARK-37051) The filter operator gets wrong results in ORC's char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430450#comment-17430450 ] frankli commented on SPARK-37051: - It seems to be affected by the right padding. [SPARK-34192][SQL] [https://github.com/apache/spark/commit/d1177b52304217f4cb86506fd1887ec98879ed16] [~yaoqiang] > The filter operator gets wrong results in ORC's char type > - > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.3.0 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Critical > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > Luckily, the varchar type is OK. > > This bug can be reproduced by a few steps. > >>> desc t2_orc; > +---++--+--+ > | col_name | data_type | comment | > +---++--+--+ > | a | string | NULL | > | b | char(50) | NULL | > | c | int | NULL | > +---++--+–+ > >>> select * from t2_orc where a='a'; > ++++--+ > | a | b | c | > ++++--+ > | a | b | 1 | > | a | b | 2 | > | a | b | 3 | > | a | b | 4 | > | a | b | 5 | > ++++–+ > >>> select * from t2_orc where b='b'; > ++++--+ > | a | b | c | > ++++--+ > ++++--+ > > By the way, Spark's tests should add more cases on the char type. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music > )] > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37058) Add spark-shell command line unit test
angerszhu created SPARK-37058: - Summary: Add spark-shell command line unit test Key: SPARK-37058 URL: https://issues.apache.org/jira/browse/SPARK-37058 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: angerszhu Add spark-shell command line unit test -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33564) Prometheus metrics for Master and Worker isn't working
[ https://issues.apache.org/jira/browse/SPARK-33564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430425#comment-17430425 ] Michał Wieleba commented on SPARK-33564: Are there any plans to introduce Prometheus metrics for yarn cluster? Any update on this? > Prometheus metrics for Master and Worker isn't working > --- > > Key: SPARK-33564 > URL: https://issues.apache.org/jira/browse/SPARK-33564 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell >Affects Versions: 3.0.0, 3.0.1 >Reporter: Paulo Roberto de Oliveira Castro >Priority: Major > Labels: Metrics, metrics, prometheus > > Following the [PR|https://github.com/apache/spark/pull/25769] that introduced > the Prometheus sink, I downloaded the {{spark-3.0.1-bin-hadoop2.7.tgz}} > (also tested with 3.0.0), uncompressed the tgz and created a file called > {{metrics.properties}} adding this content: > {quote}{{*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet}} > {{*.sink.prometheusServlet.path=/metrics/prometheus}} > master.sink.prometheusServlet.path=/metrics/master/prometheus > applications.sink.prometheusServlet.path=/metrics/applications/prometheus > {quote} > Then I ran: > {quote}{{$ sbin/start-master.sh}} > {{$ sbin/start-slave.sh spark://`hostname`:7077}} > {{$ bin/spark-shell --master spark://`hostname`:7077 > --files=./metrics.properties --conf spark.metrics.conf=./metrics.properties}} > {quote} > {{The Spark shell opens without problems:}} > {quote}{{20/11/25 17:36:07 WARN NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable}} > {{Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties}} > {{Setting default log level to "WARN".}} > {{To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).}} > {{Spark context Web UI available at > [http://192.168.0.6:4040|http://192.168.0.6:4040/]}} > {{Spark context available as 'sc' (master = > spark://MacBook-Pro-de-Paulo-2.local:7077, app id = > app-20201125173618-0002).}} > {{Spark session available as 'spark'.}} > {{Welcome to}} > {{ __}} > {{ / __/_ _/ /__}} > {{ _\ \/ _ \/ _ `/ __/ '_/}} > {{ /___/ .__/_,_/_/ /_/_\ version 3.0.0}} > {{ /_/}} > {{ }} > {{Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)}} > {{Type in expressions to have them evaluated.}} > {{Type :help for more information. }} > {{scala>}} > {quote} > {{And when I try to fetch prometheus metrics for driver, everything works > fine:}} > {quote}$ curl -s [http://localhost:4040/metrics/prometheus/] | head -n 5 > metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Number\{type="gauges"} > 0 > metrics_app_20201125173618_0002_driver_BlockManager_disk_diskSpaceUsed_MB_Value\{type="gauges"} > 0 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Number\{type="gauges"} > 732 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxMem_MB_Value\{type="gauges"} > 732 > metrics_app_20201125173618_0002_driver_BlockManager_memory_maxOffHeapMem_MB_Number\{type="gauges"} > 0 > {quote} > *The problem appears when I try accessing master metrics*, and I get the > following problem: > {quote}{{$ curl -s [http://localhost:8080/metrics/master/prometheus]}} > {{}} > {{ }} > {{ type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap-tooltip.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')}} > {{ }} > {{ href="/static/spark-logo-77x50px-hd.png">}} > {{ Spark Master at > spark://MacBook-Pro-de-Paulo-2.local:7077}} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ 3.0.0}} > {{ }} > {{ Spark Master at spark://MacBook-Pro-de-Paulo-2.local:7077}} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ }} > {{ URL: > spark://MacBook-Pro-de-Paulo-2.local:7077}} > ... > {quote} > Instead of the metrics I'm getting an HTML page. The same happens for all of > those here: > {quote}{{$ curl -s [http://localhost:8080/metrics/applications/prometheus/]}} > {{$ curl -s
[jira] [Resolved] (SPARK-37057) Fix wrong DocSearch facet filter in release-tag.sh
[ https://issues.apache.org/jira/browse/SPARK-37057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-37057. Fix Version/s: 3.2.1 3.3.0 Resolution: Fixed Issue resolved by pull request 34328 [https://github.com/apache/spark/pull/34328] > Fix wrong DocSearch facet filter in release-tag.sh > -- > > Key: SPARK-37057 > URL: https://issues.apache.org/jira/browse/SPARK-37057 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.2.0, 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 3.3.0, 3.2.1 > > > In release-tag.sh, the DocSearch facet filter should be updated as the > release version before creating git tag. > If we missed the step, the facet filter is wrong in the new release doc: > https://github.com/apache/spark/blame/v3.2.0/docs/_config.yml#L42 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] frankli updated SPARK-37051: Affects Version/s: 3.3.0 > The filter operator gets wrong results in ORC's char type > - > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1, 3.3.0 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Critical > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > Luckily, the varchar type is OK. > > This bug can be reproduced by a few steps. > >>> desc t2_orc; > +---++--+--+ > | col_name | data_type | comment | > +---++--+--+ > | a | string | NULL | > | b | char(50) | NULL | > | c | int | NULL | > +---++--+–+ > >>> select * from t2_orc where a='a'; > ++++--+ > | a | b | c | > ++++--+ > | a | b | 1 | > | a | b | 2 | > | a | b | 3 | > | a | b | 4 | > | a | b | 5 | > ++++–+ > >>> select * from t2_orc where b='b'; > ++++--+ > | a | b | c | > ++++--+ > ++++--+ > > By the way, Spark's tests should add more cases on the char type. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music > )] > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36348) unexpected Index loaded: pd.Index([10, 20, None], name="x")
[ https://issues.apache.org/jira/browse/SPARK-36348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430392#comment-17430392 ] Yikun Jiang commented on SPARK-36348: - revisit this, it had been already fixed in current master branch, will add a simple PR to optimized the testcase. {code:python} pidx = pd.Index([10, 20, 15, 30, 45, None], name="x") psidx = ps.Index(pidx) self.assert_eq(psidx.astype(bool), pidx.astype(bool)) self.assert_eq(psidx.astype(str), pidx.astype(str)) {code} > unexpected Index loaded: pd.Index([10, 20, None], name="x") > --- > > Key: SPARK-36348 > URL: https://issues.apache.org/jira/browse/SPARK-36348 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > {code:python} > pidx = pd.Index([10, 20, 15, 30, 45, None], name="x") > psidx = ps.Index(pidx) > self.assert_eq(psidx.astype(str), pidx.astype(str)) > {code} > [left pandas on spark]: Index(['10.0', '20.0', '15.0', '30.0', '45.0', > 'nan'], dtype='object', name='x') > [right pandas]: Index(['10', '20', '15', '30', '45', 'None'], dtype='object', > name='x') > The index is loaded as float64, so the follow step like astype would be diff > with pandas > [1] > https://github.com/apache/spark/blob/bcc595c112a23d8e3024ace50f0dbc7eab7144b2/python/pyspark/pandas/tests/indexes/test_base.py#L2249 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37051) The filter operator gets wrong results in ORC's char type
[ https://issues.apache.org/jira/browse/SPARK-37051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] frankli updated SPARK-37051: Affects Version/s: 3.2.1 > The filter operator gets wrong results in ORC's char type > - > > Key: SPARK-37051 > URL: https://issues.apache.org/jira/browse/SPARK-37051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.1 > Environment: Spark 3.1.2 > Scala 2.12 / Java 1.8 >Reporter: frankli >Priority: Critical > > When I try the following sample SQL on the TPCDS data, the filter operator > returns an empty row set (shown in web ui). > _select * from item where i_category = 'Music' limit 100;_ > The table is in ORC format, and i_category is char(50) type. > I guest that the char(50) type will remains redundant blanks after the actual > word. > It will affect the boolean value of "x.equals(Y)", and results in wrong > results. > Luckily, the varchar type is OK. > > This bug can be reproduced by a few steps. > >>> desc t2_orc; > +---++--+--+ > | col_name | data_type | comment | > +---++--+--+ > | a | string | NULL | > | b | char(50) | NULL | > | c | int | NULL | > +---++--+–+ > >>> select * from t2_orc where a='a'; > ++++--+ > | a | b | c | > ++++--+ > | a | b | 1 | > | a | b | 2 | > | a | b | 3 | > | a | b | 4 | > | a | b | 5 | > ++++–+ > >>> select * from t2_orc where b='b'; > ++++--+ > | a | b | c | > ++++--+ > ++++--+ > > By the way, Spark's tests should add more cases on the char type. > > == Physical Plan == > CollectLimit (3) > +- Filter (2) > +- Scan orc tpcds_bin_partitioned_orc_2.item (1) > (1) Scan orc tpcds_bin_partitioned_orc_2.item > Output [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Batched: false > Location: InMemoryFileIndex [hdfs://tpcds_bin_partitioned_orc_2.db/item] > PushedFilters: [IsNotNull(i_category), +EqualTo(i_category,+Music > )] > ReadSchema: > struct > (2) Filter > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Condition : (isnotnull(i_category#12) AND +(i_category#12 = Music ))+ > (3) CollectLimit > Input [22]: [i_item_sk#0L, i_item_id#1, i_rec_start_date#2, > i_rec_end_date#3, i_item_desc#4, i_current_price#5, i_wholesale_cost#6, > i_brand_id#7, i_brand#8, i_class_id#9, i_class#10, i_category_id#11, > i_category#12, i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, > i_color#17, i_units#18, i_container#19, i_manager_id#20, > i_product_name#21|#0L, i_item_id#1, i_rec_start_date#2, i_rec_end_date#3, > i_item_desc#4, i_current_price#5, i_wholesale_cost#6, i_brand_id#7, > i_brand#8, i_class_id#9, i_class#10, i_category_id#11, i_category#12, > i_manufact_id#13, i_manufact#14, i_size#15, i_formulation#16, i_color#17, > i_units#18, i_container#19, i_manager_id#20, i_product_name#21] > Arguments: 100 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37057) Fix wrong DocSearch facet filter in release-tag.sh
[ https://issues.apache.org/jira/browse/SPARK-37057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37057: Assignee: Gengliang Wang (was: Apache Spark) > Fix wrong DocSearch facet filter in release-tag.sh > -- > > Key: SPARK-37057 > URL: https://issues.apache.org/jira/browse/SPARK-37057 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.2.0, 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > In release-tag.sh, the DocSearch facet filter should be updated as the > release version before creating git tag. > If we missed the step, the facet filter is wrong in the new release doc: > https://github.com/apache/spark/blame/v3.2.0/docs/_config.yml#L42 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37057) Fix wrong DocSearch facet filter in release-tag.sh
[ https://issues.apache.org/jira/browse/SPARK-37057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17430368#comment-17430368 ] Apache Spark commented on SPARK-37057: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/34328 > Fix wrong DocSearch facet filter in release-tag.sh > -- > > Key: SPARK-37057 > URL: https://issues.apache.org/jira/browse/SPARK-37057 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.2.0, 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > In release-tag.sh, the DocSearch facet filter should be updated as the > release version before creating git tag. > If we missed the step, the facet filter is wrong in the new release doc: > https://github.com/apache/spark/blame/v3.2.0/docs/_config.yml#L42 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org