[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165653#comment-17165653 ] Apache Spark commented on SPARK-29302: -- User 'WinkerDu' has created a pull request for this issue: https://github.com/apache/spark/pull/29260 > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP
[ https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32457: Assignee: Apache Spark > logParam thresholds in DT/GBT/FM/LR/MLP > > > Key: SPARK-32457 > URL: https://issues.apache.org/jira/browse/SPARK-32457 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Trivial > > param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30794) Stage Level scheduling: Add ability to set off heap memory
[ https://issues.apache.org/jira/browse/SPARK-30794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-30794: - Assignee: Zhongwei Zhu > Stage Level scheduling: Add ability to set off heap memory > -- > > Key: SPARK-30794 > URL: https://issues.apache.org/jira/browse/SPARK-30794 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Assignee: Zhongwei Zhu >Priority: Major > > For stage level scheduling in ExecutorResourceRequests we support setting > heap memory, pyspark memory, and memory overhead. We have no split out off > heap memory as its own configuration so we should add it as an option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30794) Stage Level scheduling: Add ability to set off heap memory
[ https://issues.apache.org/jira/browse/SPARK-30794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-30794. --- Fix Version/s: 3.1.0 Resolution: Fixed > Stage Level scheduling: Add ability to set off heap memory > -- > > Key: SPARK-30794 > URL: https://issues.apache.org/jira/browse/SPARK-30794 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Assignee: Zhongwei Zhu >Priority: Major > Fix For: 3.1.0 > > > For stage level scheduling in ExecutorResourceRequests we support setting > heap memory, pyspark memory, and memory overhead. We have no split out off > heap memory as its own configuration so we should add it as an option. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long
[ https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165611#comment-17165611 ] Apache Spark commented on SPARK-29918: -- User 'mundaym' has created a pull request for this issue: https://github.com/apache/spark/pull/29259 > RecordBinaryComparator should check endianness when compared by long > > > Key: SPARK-29918 > URL: https://issues.apache.org/jira/browse/SPARK-29918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: EdisonWang >Assignee: EdisonWang >Priority: Minor > Labels: correctness > Fix For: 2.4.5, 3.0.0 > > > If the architecture supports unaligned or the offset is 8 bytes aligned, > RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a > long. Otherwise, it will compare bytes by bytes. > However, on little-endian machine, the result of compared by a long value > and compared bytes by bytes maybe different. If the architectures in a yarn > cluster is different(Some is unaligned-access capable while others not), then > the sequence of two records after sorted is undetermined, which will result > in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32459) UDF regression of WrappedArray supporting caused by SPARK-31826
wuyi created SPARK-32459: Summary: UDF regression of WrappedArray supporting caused by SPARK-31826 Key: SPARK-32459 URL: https://issues.apache.org/jira/browse/SPARK-32459 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: wuyi {code:java} test("WrappedArray") { val myUdf = udf((a: WrappedArray[Int]) => WrappedArray.make[Int](Array(a.head + 99))) checkAnswer(Seq(Array(1)) .toDF("col") .select(myUdf(Column("col"))), Row(ArrayBuffer(100))) }{code} Execute the above test in master branch, we'll hit the error: {code:java} [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, 192.168.101.3, executor driver): java.lang.RuntimeException: Error while decoding: java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to scala.collection.mutable.WrappedArray[info] {code} However, the test can be executed successfully in branch-3.0. This's actually a regression caused by SPARK-31826. And the regression happens after we changed the catalyst-to-scala converter from CatalystTypeConverters to ExpressionEncoder.deserializer . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32458) Mismatched row access sizes in tests
[ https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Munday updated SPARK-32458: --- Description: The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This is because the test data is written into the row using unsafe operations with one size and then read back using a different size. For example, in UnsafeMapSuite the test data is written using putLong and then read back using getInt. This happens to work on little-endian systems but these differences appear to be typos and cause the tests to fail on big-endian systems. I have a patch that fixes the issue. was: The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This is because the test data is written into the row using unsafe operations with one size and then read back using a different size. For example, in UnsafeMapSuite the test data is written using putLong and then read back using getInt. This happens to work on little-endian systems but these differences appear to be typos and causes the tests to fail on big-endian systems. I have a patch that fixes the issue. > Mismatched row access sizes in tests > > > Key: SPARK-32458 > URL: https://issues.apache.org/jira/browse/SPARK-32458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Michael Munday >Priority: Minor > Labels: catalyst, endianness > > The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This > is because the test data is written into the row using unsafe operations with > one size and then read back using a different size. For example, in > UnsafeMapSuite the test data is written using putLong and then read back > using getInt. This happens to work on little-endian systems but these > differences appear to be typos and cause the tests to fail on big-endian > systems. > I have a patch that fixes the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32458) Mismatched row access sizes in tests
Michael Munday created SPARK-32458: -- Summary: Mismatched row access sizes in tests Key: SPARK-32458 URL: https://issues.apache.org/jira/browse/SPARK-32458 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Michael Munday The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This is because the test data is written into the row using unsafe operations with one size and then read back using a different size. For example, in UnsafeMapSuite the test data is written using putLong and then read back using getInt. This happens to work on little-endian systems but these differences appear to be typos and causes the tests to fail on big-endian systems. I have a patch that fixes the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19169) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files
[ https://issues.apache.org/jira/browse/SPARK-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165657#comment-17165657 ] bianqi commented on SPARK-19169: [~hyukjin.kwon] hello We also encountered this problem in the production environment. {quote}java.lang.IndexOutOfBoundsException: toIndex = 63 at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) at java.util.ArrayList.subList(ArrayList.java:996) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:541) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1273) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1170) at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){quote} > columns changed orc table encouter 'IndexOutOfBoundsException' when read the > old schema files > - > > Key: SPARK-19169 > URL: https://issues.apache.org/jira/browse/SPARK-19169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: roncenzhao >Priority: Major > > We hava an orc table called orc_test_tbl and hava inserted some data into it. > After that, we change the table schema by droping some columns. > When reading the old schema file, we get the follow exception. > ``` > java.lang.IndexOutOfBoundsException: toIndex = 65 > at java.util.ArrayList.subListRangeCheck(ArrayList.java:962) > at java.util.ArrayList.subList(ArrayList.java:954) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) > at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) > at > org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) > at >
[jira] [Commented] (SPARK-32458) Mismatched row access sizes in tests
[ https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165604#comment-17165604 ] Apache Spark commented on SPARK-32458: -- User 'mundaym' has created a pull request for this issue: https://github.com/apache/spark/pull/29258 > Mismatched row access sizes in tests > > > Key: SPARK-32458 > URL: https://issues.apache.org/jira/browse/SPARK-32458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Michael Munday >Priority: Minor > Labels: catalyst, endianness > > The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This > is because the test data is written into the row using unsafe operations with > one size and then read back using a different size. For example, in > UnsafeMapSuite the test data is written using putLong and then read back > using getInt. This happens to work on little-endian systems but these > differences appear to be typos and cause the tests to fail on big-endian > systems. > I have a patch that fixes the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32458) Mismatched row access sizes in tests
[ https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32458: Assignee: (was: Apache Spark) > Mismatched row access sizes in tests > > > Key: SPARK-32458 > URL: https://issues.apache.org/jira/browse/SPARK-32458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Michael Munday >Priority: Minor > Labels: catalyst, endianness > > The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This > is because the test data is written into the row using unsafe operations with > one size and then read back using a different size. For example, in > UnsafeMapSuite the test data is written using putLong and then read back > using getInt. This happens to work on little-endian systems but these > differences appear to be typos and cause the tests to fail on big-endian > systems. > I have a patch that fixes the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32458) Mismatched row access sizes in tests
[ https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32458: Assignee: Apache Spark > Mismatched row access sizes in tests > > > Key: SPARK-32458 > URL: https://issues.apache.org/jira/browse/SPARK-32458 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Michael Munday >Assignee: Apache Spark >Priority: Minor > Labels: catalyst, endianness > > The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This > is because the test data is written into the row using unsafe operations with > one size and then read back using a different size. For example, in > UnsafeMapSuite the test data is written using putLong and then read back > using getInt. This happens to work on little-endian systems but these > differences appear to be typos and cause the tests to fail on big-endian > systems. > I have a patch that fixes the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long
[ https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165614#comment-17165614 ] Apache Spark commented on SPARK-29918: -- User 'mundaym' has created a pull request for this issue: https://github.com/apache/spark/pull/29259 > RecordBinaryComparator should check endianness when compared by long > > > Key: SPARK-29918 > URL: https://issues.apache.org/jira/browse/SPARK-29918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: EdisonWang >Assignee: EdisonWang >Priority: Minor > Labels: correctness > Fix For: 2.4.5, 3.0.0 > > > If the architecture supports unaligned or the offset is 8 bytes aligned, > RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a > long. Otherwise, it will compare bytes by bytes. > However, on little-endian machine, the result of compared by a long value > and compared bytes by bytes maybe different. If the architectures in a yarn > cluster is different(Some is unaligned-access capable while others not), then > the sequence of two records after sorted is undetermined, which will result > in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32435) Remove heapq3 port from Python 3
[ https://issues.apache.org/jira/browse/SPARK-32435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32435. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29229 [https://github.com/apache/spark/pull/29229] > Remove heapq3 port from Python 3 > > > Key: SPARK-32435 > URL: https://issues.apache.org/jira/browse/SPARK-32435 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > We have the manual port heapq3 > https://github.com/apache/spark/blob/c3be2cd347c42972d9c499b6fd9a6f988f80af12/python/pyspark/heapq3.py > to support {{key}} and {{reverse}}. > It should be removed together since Python 2 was dropped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32435) Remove heapq3 port from Python 3
[ https://issues.apache.org/jira/browse/SPARK-32435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32435: Assignee: Hyukjin Kwon > Remove heapq3 port from Python 3 > > > Key: SPARK-32435 > URL: https://issues.apache.org/jira/browse/SPARK-32435 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > We have the manual port heapq3 > https://github.com/apache/spark/blob/c3be2cd347c42972d9c499b6fd9a6f988f80af12/python/pyspark/heapq3.py > to support {{key}} and {{reverse}}. > It should be removed together since Python 2 was dropped. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32425) Spark sequence() fails if start and end of range are identical timestamps
[ https://issues.apache.org/jira/browse/SPARK-32425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165634#comment-17165634 ] JinxinTang commented on SPARK-32425: cc [~viirya] Because I have fixed it by https://issues.apache.org/jira/browse/SPARK-31980# : ) > Spark sequence() fails if start and end of range are identical timestamps > - > > Key: SPARK-32425 > URL: https://issues.apache.org/jira/browse/SPARK-32425 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0 on Databricks Runtime 7.1 >Reporter: Lauri Koobas >Priority: Minor > > The following Spark SQL query throws an exception > {code:java} > select sequence(current_timestamp, current_timestamp) > {code} > The error is: > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 1 > at > scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:78){noformat} > Essentially a copy of https://issues.apache.org/jira/browse/SPARK-31980# -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165649#comment-17165649 ] Apache Spark commented on SPARK-27194: -- User 'WinkerDu' has created a pull request for this issue: https://github.com/apache/spark/pull/29260 > Job failures when task attempts do not clean up spark-staging parquet files > --- > > Key: SPARK-27194 > URL: https://issues.apache.org/jira/browse/SPARK-27194 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1, 2.3.2, 2.3.3 >Reporter: Reza Safi >Priority: Major > > When a container fails for some reason (for example when killed by yarn for > exceeding memory limits), the subsequent task attempts for the tasks that > were running on that container all fail with a FileAlreadyExistsException. > The original task attempt does not seem to successfully call abortTask (or at > least its "best effort" delete is unsuccessful) and clean up the parquet file > it was writing to, so when later task attempts try to write to the same > spark-staging directory using the same file name, the job fails. > Here is what transpires in the logs: > The container where task 200.0 is running is killed and the task is lost: > {code} > 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on > t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 > GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. > 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage > 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 > exited caused by one of the running tasks) Reason: Container killed by YARN > for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > {code} > The task is re-attempted on a different executor and fails because the > part-00200-blah-blah.c000.snappy.parquet file from the first task attempt > already exists: > {code} > 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 > (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client a.b.c.d already exists > {code} > The job fails when the the configured task attempts (spark.task.maxFailures) > have failed with the same error: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 > in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage > 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > ... > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client i.p.a.d already exists > {code} > SPARK-26682 wasn't the root cause here, since there wasn't any stage > reattempt. > This issue seems to happen when > spark.sql.sources.partitionOverwriteMode=dynamic. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165651#comment-17165651 ] Apache Spark commented on SPARK-29302: -- User 'WinkerDu' has created a pull request for this issue: https://github.com/apache/spark/pull/29260 > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20680) Spark-sql do not support for void column datatype of view
[ https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165669#comment-17165669 ] Apache Spark commented on SPARK-20680: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/29244 > Spark-sql do not support for void column datatype of view > - > > Key: SPARK-20680 > URL: https://issues.apache.org/jira/browse/SPARK-20680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Fix For: 3.1.0 > > > Create a HIVE view: > {quote} > hive> create table bad as select 1 x, null z from dual; > {quote} > Because there's no type, Hive gives it the VOID type: > {quote} > hive> describe bad; > OK > x int > z void > {quote} > In Spark2.0.x, the behaviour to read this view is normal: > {quote} > spark-sql> describe bad; > x int NULL > z voidNULL > Time taken: 4.431 seconds, Fetched 2 row(s) > {quote} > But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type > string: void > {quote} > spark-sql> describe bad; > 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int > 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void > 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] > org.apache.spark.SparkException: Cannot recognize hive type string: void > at > org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365) > > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361) > Caused by: org.apache.spark.sql.catalyst.parser.ParseException: > DataType void() is not supported.(line 1, pos 0) > == SQL == > void > ^^^ > ... 61 more > org.apache.spark.SparkException: Cannot recognize hive type string: void > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31851) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165546#comment-17165546 ] Hyukjin Kwon commented on SPARK-31851: -- The base work is done. I will create one more PR soon to show an example of the documentation so that people can easily follow. > Redesign PySpark documentation > -- > > Key: SPARK-31851 > URL: https://issues.apache.org/jira/browse/SPARK-31851 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an > exmaple. > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32455) LogisticRegressionModel prediction optimization
[ https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165552#comment-17165552 ] Apache Spark commented on SPARK-32455: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/29255 > LogisticRegressionModel prediction optimization > --- > > Key: SPARK-32455 > URL: https://issues.apache.org/jira/browse/SPARK-32455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > if needed, method getThreshold and/or following logic to compute rawThreshold > is called on each instance. > > {code:java} > override def getThreshold: Double = { > checkThresholdConsistency() > if (isSet(thresholds)) { > val ts = $(thresholds) > require(ts.length == 2, "Logistic Regression getThreshold only applies > to" + > " binary classification, but thresholds has length != 2. thresholds: " > + ts.mkString(",")) > 1.0 / (1.0 + ts(0) / ts(1)) > } else { > $(threshold) > } > } {code} > > {code:java} > val rawThreshold = if (t == 0.0) { > Double.NegativeInfinity > } else if (t == 1.0) { > Double.PositiveInfinity > } else { > math.log(t / (1.0 - t)) > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32455) LogisticRegressionModel prediction optimization
[ https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165553#comment-17165553 ] Apache Spark commented on SPARK-32455: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/29255 > LogisticRegressionModel prediction optimization > --- > > Key: SPARK-32455 > URL: https://issues.apache.org/jira/browse/SPARK-32455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > if needed, method getThreshold and/or following logic to compute rawThreshold > is called on each instance. > > {code:java} > override def getThreshold: Double = { > checkThresholdConsistency() > if (isSet(thresholds)) { > val ts = $(thresholds) > require(ts.length == 2, "Logistic Regression getThreshold only applies > to" + > " binary classification, but thresholds has length != 2. thresholds: " > + ts.mkString(",")) > 1.0 / (1.0 + ts(0) / ts(1)) > } else { > $(threshold) > } > } {code} > > {code:java} > val rawThreshold = if (t == 0.0) { > Double.NegativeInfinity > } else if (t == 1.0) { > Double.PositiveInfinity > } else { > math.log(t / (1.0 - t)) > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32455) LogisticRegressionModel prediction optimization
[ https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32455: Assignee: (was: Apache Spark) > LogisticRegressionModel prediction optimization > --- > > Key: SPARK-32455 > URL: https://issues.apache.org/jira/browse/SPARK-32455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > if needed, method getThreshold and/or following logic to compute rawThreshold > is called on each instance. > > {code:java} > override def getThreshold: Double = { > checkThresholdConsistency() > if (isSet(thresholds)) { > val ts = $(thresholds) > require(ts.length == 2, "Logistic Regression getThreshold only applies > to" + > " binary classification, but thresholds has length != 2. thresholds: " > + ts.mkString(",")) > 1.0 / (1.0 + ts(0) / ts(1)) > } else { > $(threshold) > } > } {code} > > {code:java} > val rawThreshold = if (t == 0.0) { > Double.NegativeInfinity > } else if (t == 1.0) { > Double.PositiveInfinity > } else { > math.log(t / (1.0 - t)) > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32455) LogisticRegressionModel prediction optimization
[ https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32455: Assignee: Apache Spark > LogisticRegressionModel prediction optimization > --- > > Key: SPARK-32455 > URL: https://issues.apache.org/jira/browse/SPARK-32455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > if needed, method getThreshold and/or following logic to compute rawThreshold > is called on each instance. > > {code:java} > override def getThreshold: Double = { > checkThresholdConsistency() > if (isSet(thresholds)) { > val ts = $(thresholds) > require(ts.length == 2, "Logistic Regression getThreshold only applies > to" + > " binary classification, but thresholds has length != 2. thresholds: " > + ts.mkString(",")) > 1.0 / (1.0 + ts(0) / ts(1)) > } else { > $(threshold) > } > } {code} > > {code:java} > val rawThreshold = if (t == 0.0) { > Double.NegativeInfinity > } else if (t == 1.0) { > Double.PositiveInfinity > } else { > math.log(t / (1.0 - t)) > } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32456) Give better error message for union streams in append mode that don't have a watermark
Yuanjian Li created SPARK-32456: --- Summary: Give better error message for union streams in append mode that don't have a watermark Key: SPARK-32456 URL: https://issues.apache.org/jira/browse/SPARK-32456 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Yuanjian Li Check the following example: {code:java} val s1 = spark.readStream.format("rate").option("rowsPerSecond", 1).load().createOrReplaceTempView("s1") val s2 = spark.readStream.format("rate").option("rowsPerSecond", 1).load().createOrReplaceTempView("s2") val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select s2.value, s2.timestamp from s2") unionResult.writeStream.option("checkpointLocation", ${pathA}).start(${pathB}) {code} We'll get the following confusing exception: {code:java} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561) at org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112) ... {code} The union clause in SQL has the requirement of deduplication, the parser will generate {{Distinct(Union)}} and the optimizer rule {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So the root cause here is the checking logic for Aggregate is missing for Distinct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32417) Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already running task which is going to cache data succeeds on a decommissioned executor
[ https://issues.apache.org/jira/browse/SPARK-32417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165466#comment-17165466 ] Apache Spark commented on SPARK-32417: -- User 'agrawaldevesh' has created a pull request for this issue: https://github.com/apache/spark/pull/29226 > Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already > running task which is going to cache data succeeds on a decommissioned > executor > --- > > Key: SPARK-32417 > URL: https://issues.apache.org/jira/browse/SPARK-32417 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126424/testReport/ > {code:java} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 2759 times over 30.001772248 > seconds. Last failure message: Map() was empty We should have a block that > has been on multiple BMs in rdds: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 > replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > from: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, > amp-jenkins-worker-05.amp, 45854, > None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, > 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > but instead we got: Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1). > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 2759 times over 30.001772248 > seconds. Last failure message: Map() was empty We should have a block that > has been on multiple BMs in rdds: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 > replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > from: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, > amp-jenkins-worker-05.amp, 45854, > None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, > 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > but instead we got: > Map(rdd_1_0 -> 1, rdd_1_2 -> 1,
[jira] [Assigned] (SPARK-32417) Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already running task which is going to cache data succeeds on a decommissioned executor
[ https://issues.apache.org/jira/browse/SPARK-32417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32417: Assignee: Apache Spark > Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already > running task which is going to cache data succeeds on a decommissioned > executor > --- > > Key: SPARK-32417 > URL: https://issues.apache.org/jira/browse/SPARK-32417 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126424/testReport/ > {code:java} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 2759 times over 30.001772248 > seconds. Last failure message: Map() was empty We should have a block that > has been on multiple BMs in rdds: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 > replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > from: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, > amp-jenkins-worker-05.amp, 45854, > None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, > 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > but instead we got: Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1). > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 2759 times over 30.001772248 > seconds. Last failure message: Map() was empty We should have a block that > has been on multiple BMs in rdds: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 > replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > from: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, > amp-jenkins-worker-05.amp, 45854, > None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, > 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > but instead we got: > Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1). > at >
[jira] [Issue Comment Deleted] (SPARK-24194) HadoopFsRelation cannot overwrite a path that is also being read from
[ https://issues.apache.org/jira/browse/SPARK-24194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] philipse updated SPARK-24194: - Comment: was deleted (was: Hi is the issue closed ? can i try it in product env? Thanks) > HadoopFsRelation cannot overwrite a path that is also being read from > - > > Key: SPARK-24194 > URL: https://issues.apache.org/jira/browse/SPARK-24194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 > Environment: spark master >Reporter: yangz >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When > {code:java} > INSERT OVERWRITE TABLE territory_count_compare select * from > territory_count_compare where shop_count!=real_shop_count > {code} > And territory_count_compare is a table with parquet, there will be a error > Cannot overwrite a path that is also being read from > > And in file MetastoreDataSourceSuite.scala, there have a test case > > > {code:java} > table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName) > {code} > > But when the table territory_count_compare is a common hive table, there is > no error. > So I think the reason is when insert overwrite into hadoopfs relation with > static partition, it first delete the partition in the output. But it should > be the time when the job commited. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32434) Support Scala 2.13 in AbstractCommandBuilder and load-spark-env scripts
[ https://issues.apache.org/jira/browse/SPARK-32434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165535#comment-17165535 ] Apache Spark commented on SPARK-32434: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29254 > Support Scala 2.13 in AbstractCommandBuilder and load-spark-env scripts > --- > > Key: SPARK-32434 > URL: https://issues.apache.org/jira/browse/SPARK-32434 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32434) Support Scala 2.13 in AbstractCommandBuilder and load-spark-env scripts
[ https://issues.apache.org/jira/browse/SPARK-32434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165536#comment-17165536 ] Apache Spark commented on SPARK-32434: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29254 > Support Scala 2.13 in AbstractCommandBuilder and load-spark-env scripts > --- > > Key: SPARK-32434 > URL: https://issues.apache.org/jira/browse/SPARK-32434 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32453) Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect it in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-32453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165549#comment-17165549 ] Hyukjin Kwon edited comment on SPARK-32453 at 7/27/20, 8:57 AM: It will be superseded by https://github.com/apache/spark/pull/29254. was (Author: hyukjin.kwon): It will be fixed together at https://github.com/apache/spark/pull/29254. > Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect > it in AppVeyor > --- > > Key: SPARK-32453 > URL: https://issues.apache.org/jira/browse/SPARK-32453 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > .\bin\spark-submit2.cmd --driver-java-options > "-Dlog4j.configuration=file:///%CD:\=/%/R/log4j.properties" --conf > spark.hadoop.fs.defaultFS="file:///" R\pkg\tests\run-all.R > "Presence of build for multiple Scala versions detected ( and )." > "Remove one of them or, set SPARK_SCALA_VERSION= in \spark-env.cmd." > "Visit for more details about setting environment variables in > spark-env.cmd." > "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd." > Command exited with code 1 > {code} > The load-spark-env script fails as below in AppVeyor. It was temporarily > explicitly set but we should remove and detect it automatically in this > script. > Possibly related to SPARK-26132, SPARK-32227 and SPARK-32434 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32455) LogisticRegressionModel prediction optimization
zhengruifeng created SPARK-32455: Summary: LogisticRegressionModel prediction optimization Key: SPARK-32455 URL: https://issues.apache.org/jira/browse/SPARK-32455 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng if needed, method getThreshold and/or following logic to compute rawThreshold is called on each instance. {code:java} override def getThreshold: Double = { checkThresholdConsistency() if (isSet(thresholds)) { val ts = $(thresholds) require(ts.length == 2, "Logistic Regression getThreshold only applies to" + " binary classification, but thresholds has length != 2. thresholds: " + ts.mkString(",")) 1.0 / (1.0 + ts(0) / ts(1)) } else { $(threshold) } } {code} {code:java} val rawThreshold = if (t == 0.0) { Double.NegativeInfinity } else if (t == 1.0) { Double.PositiveInfinity } else { math.log(t / (1.0 - t)) } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32453) Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect it in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-32453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32453. -- Resolution: Invalid > Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect > it in AppVeyor > --- > > Key: SPARK-32453 > URL: https://issues.apache.org/jira/browse/SPARK-32453 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > .\bin\spark-submit2.cmd --driver-java-options > "-Dlog4j.configuration=file:///%CD:\=/%/R/log4j.properties" --conf > spark.hadoop.fs.defaultFS="file:///" R\pkg\tests\run-all.R > "Presence of build for multiple Scala versions detected ( and )." > "Remove one of them or, set SPARK_SCALA_VERSION= in \spark-env.cmd." > "Visit for more details about setting environment variables in > spark-env.cmd." > "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd." > Command exited with code 1 > {code} > The load-spark-env script fails as below in AppVeyor. It was temporarily > explicitly set but we should remove and detect it automatically in this > script. > Possibly related to SPARK-26132, SPARK-32227 and SPARK-32434 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32453) Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect it in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-32453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165549#comment-17165549 ] Hyukjin Kwon commented on SPARK-32453: -- It will be fixed together at https://github.com/apache/spark/pull/29254. > Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect > it in AppVeyor > --- > > Key: SPARK-32453 > URL: https://issues.apache.org/jira/browse/SPARK-32453 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > {code} > .\bin\spark-submit2.cmd --driver-java-options > "-Dlog4j.configuration=file:///%CD:\=/%/R/log4j.properties" --conf > spark.hadoop.fs.defaultFS="file:///" R\pkg\tests\run-all.R > "Presence of build for multiple Scala versions detected ( and )." > "Remove one of them or, set SPARK_SCALA_VERSION= in \spark-env.cmd." > "Visit for more details about setting environment variables in > spark-env.cmd." > "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd." > Command exited with code 1 > {code} > The load-spark-env script fails as below in AppVeyor. It was temporarily > explicitly set but we should remove and detect it automatically in this > script. > Possibly related to SPARK-26132, SPARK-32227 and SPARK-32434 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32456) Give better error message for union streams in append mode that don't have a watermark
[ https://issues.apache.org/jira/browse/SPARK-32456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32456: Assignee: Apache Spark > Give better error message for union streams in append mode that don't have a > watermark > -- > > Key: SPARK-32456 > URL: https://issues.apache.org/jira/browse/SPARK-32456 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Apache Spark >Priority: Major > > Check the following example: > > {code:java} > val s1 = spark.readStream.format("rate").option("rowsPerSecond", > 1).load().createOrReplaceTempView("s1") > val s2 = spark.readStream.format("rate").option("rowsPerSecond", > 1).load().createOrReplaceTempView("s2") > val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select > s2.value, s2.timestamp from s2") > unionResult.writeStream.option("checkpointLocation", ${pathA}).start(${pathB}) > {code} > > We'll get the following confusing exception: > {code:java} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561) > at > org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112) > ... > {code} > The union clause in SQL has the requirement of deduplication, the parser will > generate {{Distinct(Union)}} and the optimizer rule > {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So > the root cause here is the checking logic for Aggregate is missing for > Distinct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32456) Give better error message for union streams in append mode that don't have a watermark
[ https://issues.apache.org/jira/browse/SPARK-32456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32456: Assignee: (was: Apache Spark) > Give better error message for union streams in append mode that don't have a > watermark > -- > > Key: SPARK-32456 > URL: https://issues.apache.org/jira/browse/SPARK-32456 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > Check the following example: > > {code:java} > val s1 = spark.readStream.format("rate").option("rowsPerSecond", > 1).load().createOrReplaceTempView("s1") > val s2 = spark.readStream.format("rate").option("rowsPerSecond", > 1).load().createOrReplaceTempView("s2") > val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select > s2.value, s2.timestamp from s2") > unionResult.writeStream.option("checkpointLocation", ${pathA}).start(${pathB}) > {code} > > We'll get the following confusing exception: > {code:java} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561) > at > org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112) > ... > {code} > The union clause in SQL has the requirement of deduplication, the parser will > generate {{Distinct(Union)}} and the optimizer rule > {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So > the root cause here is the checking logic for Aggregate is missing for > Distinct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32456) Give better error message for union streams in append mode that don't have a watermark
[ https://issues.apache.org/jira/browse/SPARK-32456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165567#comment-17165567 ] Apache Spark commented on SPARK-32456: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/29256 > Give better error message for union streams in append mode that don't have a > watermark > -- > > Key: SPARK-32456 > URL: https://issues.apache.org/jira/browse/SPARK-32456 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > Check the following example: > > {code:java} > val s1 = spark.readStream.format("rate").option("rowsPerSecond", > 1).load().createOrReplaceTempView("s1") > val s2 = spark.readStream.format("rate").option("rowsPerSecond", > 1).load().createOrReplaceTempView("s2") > val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select > s2.value, s2.timestamp from s2") > unionResult.writeStream.option("checkpointLocation", ${pathA}).start(${pathB}) > {code} > > We'll get the following confusing exception: > {code:java} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561) > at > org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112) > ... > {code} > The union clause in SQL has the requirement of deduplication, the parser will > generate {{Distinct(Union)}} and the optimizer rule > {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So > the root cause here is the checking logic for Aggregate is missing for > Distinct. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP
zhengruifeng created SPARK-32457: Summary: logParam thresholds in DT/GBT/FM/LR/MLP Key: SPARK-32457 URL: https://issues.apache.org/jira/browse/SPARK-32457 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32417) Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already running task which is going to cache data succeeds on a decommissioned executor
[ https://issues.apache.org/jira/browse/SPARK-32417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32417: Assignee: (was: Apache Spark) > Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already > running task which is going to cache data succeeds on a decommissioned > executor > --- > > Key: SPARK-32417 > URL: https://issues.apache.org/jira/browse/SPARK-32417 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126424/testReport/ > {code:java} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 2759 times over 30.001772248 > seconds. Last failure message: Map() was empty We should have a block that > has been on multiple BMs in rdds: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 > replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > from: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, > amp-jenkins-worker-05.amp, 45854, > None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, > 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > but instead we got: Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1). > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 2759 times over 30.001772248 > seconds. Last failure message: Map() was empty We should have a block that > has been on multiple BMs in rdds: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 > replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > from: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, > amp-jenkins-worker-05.amp, 45854, > None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, > 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > but instead we got: > Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1). > at > org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432) >
[jira] [Commented] (SPARK-31587) R installation in Github Actions is being failed
[ https://issues.apache.org/jira/browse/SPARK-31587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165511#comment-17165511 ] Apache Spark commented on SPARK-31587: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28380 > R installation in Github Actions is being failed > > > Key: SPARK-31587 > URL: https://issues.apache.org/jira/browse/SPARK-31587 > Project: Spark > Issue Type: Bug > Components: Project Infra, R >Affects Versions: 2.4.5, 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently, R installation seems being failed as below: > {code} > Get:61 https://dl.bintray.com/sbt/debian Packages [4174 B] > Get:62 http://ppa.launchpad.net/apt-fast/stable/ubuntu bionic/main amd64 > Packages [532 B] > Get:63 http://ppa.launchpad.net/git-core/ppa/ubuntu bionic/main amd64 > Packages [3036 B] > Get:64 http://ppa.launchpad.net/ondrej/php/ubuntu bionic/main amd64 Packages > [52.0 kB] > Get:65 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main > amd64 Packages [33.9 kB] > Get:66 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main > Translation-en [10.1 kB] > Reading package lists... > E: The repository 'https://cloud.r-project.org/bin/linux/ubuntu > bionic-cran35/ Release' does not have a Release file. > ##[error]Process completed with exit code 100. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29664) Column.getItem behavior is not consistent with Scala version
[ https://issues.apache.org/jira/browse/SPARK-29664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165510#comment-17165510 ] Apache Spark commented on SPARK-29664: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28327 > Column.getItem behavior is not consistent with Scala version > > > Key: SPARK-29664 > URL: https://issues.apache.org/jira/browse/SPARK-29664 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > In PySpark, Column.getItem's behavior is different from the Scala version. > For example, > In PySpark: > {code:python} > df = spark.range(2) > map_col = create_map(lit(0), lit(100), lit(1), lit(200)) > df.withColumn("mapped", map_col.getItem(col('id'))).show() > # +---+--+ > # | id|mapped| > # +---+--+ > # | 0| 100| > # | 1| 200| > # +---+--+ > {code} > In Scala: > {code:scala} > val df = spark.range(2) > val map_col = map(lit(0), lit(100), lit(1), lit(200)) > // The following getItem results in the following exception, which is the > right behavior: > // java.lang.RuntimeException: Unsupported literal type class > org.apache.spark.sql.Column id > // at > org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) > // at org.apache.spark.sql.Column.getItem(Column.scala:856) > // ... 49 elided > df.withColumn("mapped", map_col.getItem(col("id"))).show > // You have to use apply() to match with PySpark's behavior. > df.withColumn("mapped", map_col(col("id"))).show > // +---+--+ > // | id|mapped| > // +---+--+ > // | 0| 100| > // | 1| 200| > // +---+--+ > {code} > Looking at the code for Scala implementation, PySpark's behavior is incorrect > since the argument to getItem becomes `Literal`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32188) API Reference
[ https://issues.apache.org/jira/browse/SPARK-32188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32188. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29188 [https://github.com/apache/spark/pull/29188] > API Reference > - > > Key: SPARK-32188 > URL: https://issues.apache.org/jira/browse/SPARK-32188 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Example: https://hyukjin-spark.readthedocs.io/en/latest/reference/index.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32179) Replace and redesign the documentation base
[ https://issues.apache.org/jira/browse/SPARK-32179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32179. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29188 [https://github.com/apache/spark/pull/29188] > Replace and redesign the documentation base > --- > > Key: SPARK-32179 > URL: https://issues.apache.org/jira/browse/SPARK-32179 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > I have a demo site for this task. See > https://hyukjin-spark.readthedocs.io/en/latest/ as an example. > The base work should be fine first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP
[ https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32457: Assignee: (was: Apache Spark) > logParam thresholds in DT/GBT/FM/LR/MLP > > > Key: SPARK-32457 > URL: https://issues.apache.org/jira/browse/SPARK-32457 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Trivial > > param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP
[ https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165573#comment-17165573 ] Apache Spark commented on SPARK-32457: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/29257 > logParam thresholds in DT/GBT/FM/LR/MLP > > > Key: SPARK-32457 > URL: https://issues.apache.org/jira/browse/SPARK-32457 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Trivial > > param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP
[ https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165572#comment-17165572 ] Apache Spark commented on SPARK-32457: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/29257 > logParam thresholds in DT/GBT/FM/LR/MLP > > > Key: SPARK-32457 > URL: https://issues.apache.org/jira/browse/SPARK-32457 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Trivial > > param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32425) Spark sequence() fails if start and end of range are identical timestamps
[ https://issues.apache.org/jira/browse/SPARK-32425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-32425. - Resolution: Duplicate > Spark sequence() fails if start and end of range are identical timestamps > - > > Key: SPARK-32425 > URL: https://issues.apache.org/jira/browse/SPARK-32425 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0 on Databricks Runtime 7.1 >Reporter: Lauri Koobas >Priority: Minor > > The following Spark SQL query throws an exception > {code:java} > select sequence(current_timestamp, current_timestamp) > {code} > The error is: > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 1 > at > scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:78){noformat} > Essentially a copy of https://issues.apache.org/jira/browse/SPARK-31980# -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32459) UDF regression of WrappedArray supporting caused by SPARK-31826
[ https://issues.apache.org/jira/browse/SPARK-32459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165698#comment-17165698 ] Apache Spark commented on SPARK-32459: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/29261 > UDF regression of WrappedArray supporting caused by SPARK-31826 > --- > > Key: SPARK-32459 > URL: https://issues.apache.org/jira/browse/SPARK-32459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > > {code:java} > test("WrappedArray") { > val myUdf = udf((a: WrappedArray[Int]) => > WrappedArray.make[Int](Array(a.head + 99))) > checkAnswer(Seq(Array(1)) > .toDF("col") > .select(myUdf(Column("col"))), > Row(ArrayBuffer(100))) > }{code} > Execute the above test in master branch, we'll hit the error: > {code:java} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 0.0 (TID 0, 192.168.101.3, executor driver): > java.lang.RuntimeException: Error while decoding: > java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be > cast to scala.collection.mutable.WrappedArray[info] {code} > However, the test can be executed successfully in branch-3.0. > > This's actually a regression caused by SPARK-31826. And the regression > happens after we changed the catalyst-to-scala converter from > CatalystTypeConverters to ExpressionEncoder.deserializer . > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32459) UDF regression of WrappedArray supporting caused by SPARK-31826
[ https://issues.apache.org/jira/browse/SPARK-32459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32459: Assignee: (was: Apache Spark) > UDF regression of WrappedArray supporting caused by SPARK-31826 > --- > > Key: SPARK-32459 > URL: https://issues.apache.org/jira/browse/SPARK-32459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > > {code:java} > test("WrappedArray") { > val myUdf = udf((a: WrappedArray[Int]) => > WrappedArray.make[Int](Array(a.head + 99))) > checkAnswer(Seq(Array(1)) > .toDF("col") > .select(myUdf(Column("col"))), > Row(ArrayBuffer(100))) > }{code} > Execute the above test in master branch, we'll hit the error: > {code:java} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 0.0 (TID 0, 192.168.101.3, executor driver): > java.lang.RuntimeException: Error while decoding: > java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be > cast to scala.collection.mutable.WrappedArray[info] {code} > However, the test can be executed successfully in branch-3.0. > > This's actually a regression caused by SPARK-31826. And the regression > happens after we changed the catalyst-to-scala converter from > CatalystTypeConverters to ExpressionEncoder.deserializer . > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32459) UDF regression of WrappedArray supporting caused by SPARK-31826
[ https://issues.apache.org/jira/browse/SPARK-32459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165696#comment-17165696 ] Apache Spark commented on SPARK-32459: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/29261 > UDF regression of WrappedArray supporting caused by SPARK-31826 > --- > > Key: SPARK-32459 > URL: https://issues.apache.org/jira/browse/SPARK-32459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > > {code:java} > test("WrappedArray") { > val myUdf = udf((a: WrappedArray[Int]) => > WrappedArray.make[Int](Array(a.head + 99))) > checkAnswer(Seq(Array(1)) > .toDF("col") > .select(myUdf(Column("col"))), > Row(ArrayBuffer(100))) > }{code} > Execute the above test in master branch, we'll hit the error: > {code:java} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 0.0 (TID 0, 192.168.101.3, executor driver): > java.lang.RuntimeException: Error while decoding: > java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be > cast to scala.collection.mutable.WrappedArray[info] {code} > However, the test can be executed successfully in branch-3.0. > > This's actually a regression caused by SPARK-31826. And the regression > happens after we changed the catalyst-to-scala converter from > CatalystTypeConverters to ExpressionEncoder.deserializer . > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32459) UDF regression of WrappedArray supporting caused by SPARK-31826
[ https://issues.apache.org/jira/browse/SPARK-32459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32459: Assignee: Apache Spark > UDF regression of WrappedArray supporting caused by SPARK-31826 > --- > > Key: SPARK-32459 > URL: https://issues.apache.org/jira/browse/SPARK-32459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > > {code:java} > test("WrappedArray") { > val myUdf = udf((a: WrappedArray[Int]) => > WrappedArray.make[Int](Array(a.head + 99))) > checkAnswer(Seq(Array(1)) > .toDF("col") > .select(myUdf(Column("col"))), > Row(ArrayBuffer(100))) > }{code} > Execute the above test in master branch, we'll hit the error: > {code:java} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 0.0 (TID 0, 192.168.101.3, executor driver): > java.lang.RuntimeException: Error while decoding: > java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be > cast to scala.collection.mutable.WrappedArray[info] {code} > However, the test can be executed successfully in branch-3.0. > > This's actually a regression caused by SPARK-31826. And the regression > happens after we changed the catalyst-to-scala converter from > CatalystTypeConverters to ExpressionEncoder.deserializer . > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32425) Spark sequence() fails if start and end of range are identical timestamps
[ https://issues.apache.org/jira/browse/SPARK-32425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165810#comment-17165810 ] L. C. Hsieh commented on SPARK-32425: - Thanks [~JinxinTang]. > Spark sequence() fails if start and end of range are identical timestamps > - > > Key: SPARK-32425 > URL: https://issues.apache.org/jira/browse/SPARK-32425 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0 on Databricks Runtime 7.1 >Reporter: Lauri Koobas >Priority: Minor > > The following Spark SQL query throws an exception > {code:java} > select sequence(current_timestamp, current_timestamp) > {code} > The error is: > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 1 > at > scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:78){noformat} > Essentially a copy of https://issues.apache.org/jira/browse/SPARK-31980# -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-32431) The .schema() API behaves incorrectly for nested schemas that have column duplicates in case-insensitive mode
[ https://issues.apache.org/jira/browse/SPARK-32431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-32431: --- Comment: was deleted (was: I cannot reproduce the issue on master, branch-3.0 and branch-2.4. I opened this PR [https://github.com/apache/spark/pull/29234] with tests. [~mswit] Could you help me to reproduce the issue, please.) > The .schema() API behaves incorrectly for nested schemas that have column > duplicates in case-insensitive mode > - > > Key: SPARK-32431 > URL: https://issues.apache.org/jira/browse/SPARK-32431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Michał Świtakowski >Priority: Major > > The code below throws org.apache.spark.sql.AnalysisException: Found duplicate > column(s) in the data schema: `camelcase`; for multiple file formats due to a > duplicate column in the requested schema. > {code:java} > import org.apache.spark.sql.types._ > spark.conf.set("spark.sql.caseSensitive", "false") > val formats = Seq("parquet", "orc", "avro", "json") > val caseInsensitiveSchema = new StructType().add("LowerCase", > LongType).add("camelcase", LongType).add("CamelCase", LongType) > formats.map{ format => > val path = s"/tmp/$format" > spark > .range(1L) > .selectExpr("id AS lowercase", "id + 1 AS camelCase") > .write.mode("overwrite").format(format).save(path) > spark.read.schema(caseInsensitiveSchema).format(format).load(path).show > } > {code} > Similar code with nested schema behaves inconsistently across file formats > and sometimes returns incorrect results: > {code:java} > import org.apache.spark.sql.types._ > spark.conf.set("spark.sql.caseSensitive", "false") > val formats = Seq("parquet", "orc", "avro", "json") > val caseInsensitiveSchema = new StructType().add("StructColumn", new > StructType().add("LowerCase", LongType).add("camelcase", > LongType).add("CamelCase", LongType)) > formats.map{ format => > val path = s"/tmp/$format" > spark > .range(1L) > .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS > StructColumn") > .write.mode("overwrite").format(format).save(path) > > spark.read.schema(caseInsensitiveSchema).format(format).load(path).show > } > {code} > The desired behavior likely should be returning an exception just like in the > flat schema scenario. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32420) Add handling for unique key in non-codegen hash join
[ https://issues.apache.org/jira/browse/SPARK-32420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32420: --- Assignee: Cheng Su > Add handling for unique key in non-codegen hash join > > > Key: SPARK-32420 > URL: https://issues.apache.org/jira/browse/SPARK-32420 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > > `HashRelation` has two separate code paths for unique key look up and > non-unique key look up E.g. in its subclass > `UnsafeHashedRelation`([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177]), > unique key look up is more efficient as it does not have extra > `Iterator[UnsafeRow].hasNext()/next()` overhead per row. > `BroadcastHashJoinExec` has handled unique key vs non-unique key separately > in code-gen path > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321]). > But the non-codegen path for broadcast hash join and shuffled hash join do > not separate it yet, so adding the support here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31993) Generated code in 'concat_ws' fails to compile when splitting method is in effect
[ https://issues.apache.org/jira/browse/SPARK-31993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-31993: Component/s: (was: Spark Core) SQL > Generated code in 'concat_ws' fails to compile when splitting method is in > effect > - > > Key: SPARK-31993 > URL: https://issues.apache.org/jira/browse/SPARK-31993 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.1.0 > > > https://github.com/apache/spark/blob/a0187cd6b59a6b6bb2cadc6711bb663d4d35a844/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L88-L195 > There're three parts of generated code in concat_ws (codes, varargCounts, > varargBuilds) and all parts try to split method by itself, while > `varargCounts` and `varargBuilds` refer on the generated code in `codes`, > hence the overall generated code fails to compile if any of part succeeds to > split. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
[ https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165934#comment-17165934 ] Thomas Graves commented on SPARK-32429: --- So this doesn't address the task side, it addresses the executor side. The Worker has a discovery script that just returns an array of strings, as long as the address is something CUDA_VISIBLE_DEVICES understands its fine. I would expect this to be configurable so users could turn it on and off if needed. The worker sets it to be how many GPUs it was going to assign to the executor before (via passing in the resources file). Within the executor, we still assign a specific GPU to a task when we launch it and the task could set via the cuda api (cudaSetDevice) if they wish to restrict it. Setting CUDA_VISIBLE_DEVICES ensures that the task doesn't use a GPU that got assigned to another executor. This is essentially what YARN and k8s do today running in docker container, the container can only see the number of GPUs requested but its still up to app code to set to per thread (task) if necessary. Yeah the python side is perhaps more confusing in that respect, but my assumption was just set CUDA_VISIBLE_DEVICES to be the same as the JVM process. I think that would still allow you to reuse the python processes for different tasks and leaves the same contraint to application code to have to set to specifically handle per task setting. Again I think this is the same as a python process spawned inside yarn/k8s docker container with GPU isolation on. In general I would expect the GPU to really only be used by either the python or the jvm process, but GPUs can handle both using it and context switching as long as they don't both use all the memory. Here is a rough prototype for the java side: [https://github.com/tgravescs/spark/commit/8f1a13d5eef82f81ef3c424a9a7b4b47903aab7b] > Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch > - > > Key: SPARK-32429 > URL: https://issues.apache.org/jira/browse/SPARK-32429 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It would be nice if standalone mode could allow users to set > CUDA_VISIBLE_DEVICES before launching an executor. This has multiple > benefits. > * kind of an isolation in that the executor can only see the GPUs set there. > * If your GPU application doesn't support explicitly setting the GPU device > id, setting this will make any GPU look like the default (id 0) and things > generally just work without any explicit setting > * New features are being added on newer GPUs that require explicit setting > of CUDA_VISIBLE_DEVICES like MIG > ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/]) > The code changes to just set this are very small, once we set them we would > also possibly need to change the gpu addresses as it changes them to start > from device id 0 again. > The easiest implementation would just specifically support this and have it > behind a config and set when the config is on and GPU resources are > allocated. > Note we probably want to have this same thing set when we launch a python > process as well so that it gets same env. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32424) Fix silent data change for timestamp parsing if overflow happens
[ https://issues.apache.org/jira/browse/SPARK-32424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32424: --- Assignee: Kent Yao > Fix silent data change for timestamp parsing if overflow happens > > > Key: SPARK-32424 > URL: https://issues.apache.org/jira/browse/SPARK-32424 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > the overflow behavior of 3.0 for timestamp parsing has been changed compared > to 2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32424) Fix silent data change for timestamp parsing if overflow happens
[ https://issues.apache.org/jira/browse/SPARK-32424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32424. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29220 [https://github.com/apache/spark/pull/29220] > Fix silent data change for timestamp parsing if overflow happens > > > Key: SPARK-32424 > URL: https://issues.apache.org/jira/browse/SPARK-32424 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > > the overflow behavior of 3.0 for timestamp parsing has been changed compared > to 2.4 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32332) AQE doesn't adequately allow for Columnar Processing extension
[ https://issues.apache.org/jira/browse/SPARK-32332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165897#comment-17165897 ] Apache Spark commented on SPARK-32332: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29262 > AQE doesn't adequately allow for Columnar Processing extension > --- > > Key: SPARK-32332 > URL: https://issues.apache.org/jira/browse/SPARK-32332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > In SPARK-27396 we added support to extended Columnar Processing. We did the > initial work as to what we thought was sufficient but adaptive query > execution was being developed at the same time. > We have discovered that the changes made to AQE are not sufficient for users > to properly extend it for columnar processing because AQE hardcodes to look > for specific classes/execs. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32332) AQE doesn't adequately allow for Columnar Processing extension
[ https://issues.apache.org/jira/browse/SPARK-32332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165898#comment-17165898 ] Apache Spark commented on SPARK-32332: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29262 > AQE doesn't adequately allow for Columnar Processing extension > --- > > Key: SPARK-32332 > URL: https://issues.apache.org/jira/browse/SPARK-32332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > In SPARK-27396 we added support to extended Columnar Processing. We did the > initial work as to what we thought was sufficient but adaptive query > execution was being developed at the same time. > We have discovered that the changes made to AQE are not sufficient for users > to properly extend it for columnar processing because AQE hardcodes to look > for specific classes/execs. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
[ https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165903#comment-17165903 ] Xiangrui Meng commented on SPARK-32429: --- Couple questions: 1. Which GPU resource name do we use? "spark.task.resource.gpu" does not have special meaning in the current implemetnation. 2. I think we can do this for PySpark workers if 1) gets resolved. However, for executors running inside the same JVM, is there a way to set CUDA_VISIBLE_DEVICES differently per executor thread? > Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch > - > > Key: SPARK-32429 > URL: https://issues.apache.org/jira/browse/SPARK-32429 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It would be nice if standalone mode could allow users to set > CUDA_VISIBLE_DEVICES before launching an executor. This has multiple > benefits. > * kind of an isolation in that the executor can only see the GPUs set there. > * If your GPU application doesn't support explicitly setting the GPU device > id, setting this will make any GPU look like the default (id 0) and things > generally just work without any explicit setting > * New features are being added on newer GPUs that require explicit setting > of CUDA_VISIBLE_DEVICES like MIG > ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/]) > The code changes to just set this are very small, once we set them we would > also possibly need to change the gpu addresses as it changes them to start > from device id 0 again. > The easiest implementation would just specifically support this and have it > behind a config and set when the config is on and GPU resources are > allocated. > Note we probably want to have this same thing set when we launch a python > process as well so that it gets same env. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32431) The .schema() API behaves incorrectly for nested schemas that have column duplicates in case-insensitive mode
[ https://issues.apache.org/jira/browse/SPARK-32431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-32431: --- Description: The code below throws org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`; for multiple file formats due to a duplicate column in the requested schema. {code:java} import org.apache.spark.sql.types._ spark.conf.set("spark.sql.caseSensitive", "false") val formats = Seq("parquet", "orc", "avro", "json") val caseInsensitiveSchema = new StructType().add("LowerCase", LongType).add("camelcase", LongType).add("CamelCase", LongType) formats.map{ format => val path = s"/tmp/$format" spark .range(1L) .selectExpr("id AS lowercase", "id + 1 AS camelCase") .write.mode("overwrite").format(format).save(path) spark.read.schema(caseInsensitiveSchema).format(format).load(path).show } {code} Similar code with nested schema behaves inconsistently across file formats and sometimes returns incorrect results: {code:java} import org.apache.spark.sql.types._ spark.conf.set("spark.sql.caseSensitive", "false") val formats = Seq("parquet", "orc", "avro", "json") val caseInsensitiveSchema = new StructType().add("StructColumn", new StructType().add("LowerCase", LongType).add("camelcase", LongType).add("CamelCase", LongType)) formats.map{ format => val path = s"/tmp/$format" spark .range(1L) .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") .write.mode("overwrite").format(format).save(path) spark.read.schema(caseInsensitiveSchema).format(format).load(path).show } {code} The desired behavior likely should be returning an exception just like in the flat schema scenario. was: The code below throws org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`; for multiple file formats due to a duplicate column in the requested schema. {code:java} import org.apache.spark.sql.types._ spark.conf.set("spark.sql.caseSensitive", "false") val formats = Seq("parquet", "orc", "avro", "json") val caseInsensitiveSchema = new StructType().add("LowerCase", LongType).add("camelcase", LongType).add("CamelCase", LongType) formats.map{ format => val path = s"/tmp/$format" spark .range(1L) .selectExpr("id AS lowercase", "id + 1 AS camelCase") .write.mode("overwrite").format(format).save(path) spark.read.schema(caseInsensitiveSchema).format(format).load(path).show } {code} Similar code with nested schema behaves inconsistently across file formats and sometimes returns incorrect results: {code:java} import org.apache.spark.sql.types._ spark.conf.set("spark.sql.caseSensitive", "false")val formats = Seq("parquet", "orc", "avro", "json")val caseInsensitiveSchema = new StructType().add("StructColumn", new StructType().add("LowerCase", LongType).add("camelcase", LongType).add("CamelCase", LongType))formats.map{ format => val path = s"/tmp/$format" spark .range(1L) .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS StructColumn") .write.mode("overwrite").format(format).save(path) spark.read.schema(caseInsensitiveSchema).format(format).load(path).show } {code} The desired behavior likely should be returning an exception just like in the flat schema scenario. > The .schema() API behaves incorrectly for nested schemas that have column > duplicates in case-insensitive mode > - > > Key: SPARK-32431 > URL: https://issues.apache.org/jira/browse/SPARK-32431 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 3.0.0 >Reporter: Michał Świtakowski >Priority: Major > > The code below throws org.apache.spark.sql.AnalysisException: Found duplicate > column(s) in the data schema: `camelcase`; for multiple file formats due to a > duplicate column in the requested schema. > {code:java} > import org.apache.spark.sql.types._ > spark.conf.set("spark.sql.caseSensitive", "false") > val formats = Seq("parquet", "orc", "avro", "json") > val caseInsensitiveSchema = new StructType().add("LowerCase", > LongType).add("camelcase", LongType).add("CamelCase", LongType) > formats.map{ format => > val path = s"/tmp/$format" > spark > .range(1L) > .selectExpr("id AS lowercase", "id + 1 AS camelCase") > .write.mode("overwrite").format(format).save(path) > spark.read.schema(caseInsensitiveSchema).format(format).load(path).show > } > {code} > Similar code with nested schema behaves inconsistently across file formats > and sometimes returns incorrect results: > {code:java} > import org.apache.spark.sql.types._ > spark.conf.set("spark.sql.caseSensitive", "false") > val formats =
[jira] [Created] (SPARK-32460) how spark collects non-match results after performing broadcast left outer join
farshad delavarpour created SPARK-32460: --- Summary: how spark collects non-match results after performing broadcast left outer join Key: SPARK-32460 URL: https://issues.apache.org/jira/browse/SPARK-32460 Project: Spark Issue Type: Question Components: Spark Core, SQL Affects Versions: 2.4.0 Reporter: farshad delavarpour Does anybody know how spark collects non-match results after performing broadcast hash left outer join? Suppose we have 4 nodes. 1 driver and 3 executors. We broadcast the left table. After left outer join is performed in each executor, how does spark recognize which records have not been matched, and how does it collect them to the final result? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32443) Fix testCommandAvailable to use POSIX compatible `command -v`
[ https://issues.apache.org/jira/browse/SPARK-32443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32443: - Assignee: Hyukjin Kwon (was: Dongjoon Hyun) > Fix testCommandAvailable to use POSIX compatible `command -v` > - > > Key: SPARK-32443 > URL: https://issues.apache.org/jira/browse/SPARK-32443 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Currently, we run the command to check the existence. This is dangerous and > doesn't work sometimes. > {code} > scala> sys.process.Process("cat").run().exitValue() > res0: Int = 0 > scala> sys.process.Process("ls").run().exitValue() > LICENSE > NOTICE > bin > doc > lib > man > res1: Int = 0 > scala> sys.process.Process("rm").run().exitValue() > usage: rm [-f | -i] [-dPRrvW] file ... >unlink file > res4: Int = 64 > {code} > {code} > scala> sys.process.Process("command -v rm").run().exitValue() > /bin/rm > res5: Int = 0 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32443) Fix testCommandAvailable to use POSIX compatible `command -v`
[ https://issues.apache.org/jira/browse/SPARK-32443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32443. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29241 [https://github.com/apache/spark/pull/29241] > Fix testCommandAvailable to use POSIX compatible `command -v` > - > > Key: SPARK-32443 > URL: https://issues.apache.org/jira/browse/SPARK-32443 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > Currently, we run the command to check the existence. This is dangerous and > doesn't work sometimes. > {code} > scala> sys.process.Process("cat").run().exitValue() > res0: Int = 0 > scala> sys.process.Process("ls").run().exitValue() > LICENSE > NOTICE > bin > doc > lib > man > res1: Int = 0 > scala> sys.process.Process("rm").run().exitValue() > usage: rm [-f | -i] [-dPRrvW] file ... >unlink file > res4: Int = 64 > {code} > {code} > scala> sys.process.Process("command -v rm").run().exitValue() > /bin/rm > res5: Int = 0 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32420) Add handling for unique key in non-codegen hash join
[ https://issues.apache.org/jira/browse/SPARK-32420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32420. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29216 [https://github.com/apache/spark/pull/29216] > Add handling for unique key in non-codegen hash join > > > Key: SPARK-32420 > URL: https://issues.apache.org/jira/browse/SPARK-32420 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.1.0 > > > `HashRelation` has two separate code paths for unique key look up and > non-unique key look up E.g. in its subclass > `UnsafeHashedRelation`([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177]), > unique key look up is more efficient as it does not have extra > `Iterator[UnsafeRow].hasNext()/next()` overhead per row. > `BroadcastHashJoinExec` has handled unique key vs non-unique key separately > in code-gen path > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321]). > But the non-codegen path for broadcast hash join and shuffled hash join do > not separate it yet, so adding the support here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP
[ https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao resolved SPARK-32457. Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29257 [https://github.com/apache/spark/pull/29257] > logParam thresholds in DT/GBT/FM/LR/MLP > > > Key: SPARK-32457 > URL: https://issues.apache.org/jira/browse/SPARK-32457 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 3.1.0 > > > param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP
[ https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao reassigned SPARK-32457: -- Assignee: zhengruifeng > logParam thresholds in DT/GBT/FM/LR/MLP > > > Key: SPARK-32457 > URL: https://issues.apache.org/jira/browse/SPARK-32457 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > > param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32421) Add code-gen for shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-32421: - Parent: SPARK-32461 Issue Type: Sub-task (was: Improvement) > Add code-gen for shuffled hash join > --- > > Key: SPARK-32421 > URL: https://issues.apache.org/jira/browse/SPARK-32421 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > We added shuffled hash join codegen internally in our fork, and seeing > obvious improvement in benchmark compared to current non-codegen code path. > Creating this Jira to add this support. Shuffled hash join codegen is very > similar to broadcast hash join codegen. So this is a simple change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32462) Don't save the previous search text for datatable
Kousuke Saruta created SPARK-32462: -- Summary: Don't save the previous search text for datatable Key: SPARK-32462 URL: https://issues.apache.org/jira/browse/SPARK-32462 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta DataTable is used in stage-page and executors-page for pagination and filter tasks/executors by search text. In the current implementation, the keyword is saved so if we visit stage-page for a job, the previous search text is filled in the textbox and the task table is filtered. I'm sometimes surprised by this behavior as the stage-page lists no tasks because tasks are filtered by the previous search text. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21505) A dynamic join operator to improve the join reliability
[ https://issues.apache.org/jira/browse/SPARK-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-21505: - Parent: SPARK-32461 Issue Type: Sub-task (was: New Feature) > A dynamic join operator to improve the join reliability > --- > > Key: SPARK-21505 > URL: https://issues.apache.org/jira/browse/SPARK-21505 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0, 2.3.0, 3.0.0 >Reporter: Lin >Priority: Major > Labels: features > > As we know, hash join is more efficient than sort merge join. But today hash > join is not so widely used because it may fail with OutOfMemory (OOM) error > due to limited memory resource, data skew, statistics mis-estimation and so > on. For example, if we apply shuffle hash join on an uneven distributed > dataset, some partitions might be so large that we cannot make a Hash table > for this particular partition causing OOM error. When OOM happens, current > Spark technology will throw an Exception, resulting in job failure. On the > other hand, if sort-merge join is used, there will be shuffle, sorting and > extra spill, causing the degradation of the join. Considering the efficiency > of hash join, we want to propose a fallback mechanism to dynamically use hash > join or sort-merge join at runtime at task level to provide a more reliable > join operation. > This new dynamic join operator internally implements the logic of HashJoin, > Iterator Reconstruct, Sort, and MergeJoin. We show the process of this > dynamic join method as following: > HashJoin: We start from building Hash table on one side of join partitions. > If Hash table is built successfully, it would be the same as the current > ShuffledHashJoin operator. > Sort: If we fail to build Hash table due to the large partition size, we do > SortMergeJoin only on this partition. But we need to rebuild the When OOM > happens, a Hash table corresponding to partial part of this partition has > been built successfully (e.g. first 4000 rows of RDD), and the iterator of > this partition is now pointing to the 4001st row of partition. We reuse this > hash table to reconstruct the iterator for the first 4000 rows and > concatenate with the rest rows of this partition so that we can rebuild this > partition completely. On this re-built partition, we apply sorting based on > key values. > MergeJoin: After getting two sorted Iterators, we perform regular merge join > against them and emits the records to downstream operators. > Iterator Reconstruct: BytesToBytesMap has to be spilled to disk to release > the memory for other operators, such as Sort, Join, etc. In addition, it has > to be converted to Iterator, so that it can be concatenated with remaining > items in the original iterator that is used to build the hash table. > Meta Data Population: Necessary metadata, such as sorting keys, jointype, > etc, has to be populated, so that they are used for potential Sort and > MergeJoin operator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32461) Shuffled hash join improvement
[ https://issues.apache.org/jira/browse/SPARK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165951#comment-17165951 ] Cheng Su commented on SPARK-32461: -- Just FYI - I am working on each sub-tasks separately now. > Shuffled hash join improvement > -- > > Key: SPARK-32461 > URL: https://issues.apache.org/jira/browse/SPARK-32461 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > Shuffled hash join avoids sort compared to sort merge join. This advantage > shows up obviously when joining large table in terms of saving CPU and IO (in > case of external sort happens). In latest master trunk, shuffled hash join is > disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, > with favor of reducing risk of OOM. However shuffled hash join could be > improved to a better state (validated in our internal fork). Creating this > Jira to track overall progress. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32417) Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already running task which is going to cache data succeeds on a decommissioned executor
[ https://issues.apache.org/jira/browse/SPARK-32417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165979#comment-17165979 ] Apache Spark commented on SPARK-32417: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/29263 > Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already > running task which is going to cache data succeeds on a decommissioned > executor > --- > > Key: SPARK-32417 > URL: https://issues.apache.org/jira/browse/SPARK-32417 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126424/testReport/ > {code:java} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 2759 times over 30.001772248 > seconds. Last failure message: Map() was empty We should have a block that > has been on multiple BMs in rdds: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 > replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > from: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, > amp-jenkins-worker-05.amp, 45854, > None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, > 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > but instead we got: Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1). > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 2759 times over 30.001772248 > seconds. Last failure message: Map() was empty We should have a block that > has been on multiple BMs in rdds: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, > localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 > replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > from: > ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, > amp-jenkins-worker-05.amp, 45854, > None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, > 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, > 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), > SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, > 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) > but instead we got: > Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1
[jira] [Updated] (SPARK-32462) Don't save the previous search text for datatable
[ https://issues.apache.org/jira/browse/SPARK-32462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32462: --- Description: DataTable is used in stage-page and executors-page for pagination and filter tasks/executors by search text. In the current implementation, search text is saved so if we visit stage-page for a job, the previous search text is filled in the textbox and the task table is filtered. I'm sometimes surprised by this behavior as the stage-page lists no tasks because tasks are filtered by the previous search text. was: DataTable is used in stage-page and executors-page for pagination and filter tasks/executors by search text. In the current implementation, the keyword is saved so if we visit stage-page for a job, the previous search text is filled in the textbox and the task table is filtered. I'm sometimes surprised by this behavior as the stage-page lists no tasks because tasks are filtered by the previous search text. > Don't save the previous search text for datatable > - > > Key: SPARK-32462 > URL: https://issues.apache.org/jira/browse/SPARK-32462 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > DataTable is used in stage-page and executors-page for pagination and filter > tasks/executors by search text. > In the current implementation, search text is saved so if we visit stage-page > for a job, the previous search text is filled in the textbox and the task > table is filtered. > I'm sometimes surprised by this behavior as the stage-page lists no tasks > because tasks are filtered by the previous search text. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32383) Preserve hash join (BHJ and SHJ) stream side ordering
[ https://issues.apache.org/jira/browse/SPARK-32383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-32383: - Parent: SPARK-32461 Issue Type: Sub-task (was: Improvement) > Preserve hash join (BHJ and SHJ) stream side ordering > - > > Key: SPARK-32383 > URL: https://issues.apache.org/jira/browse/SPARK-32383 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.1.0 > > > Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve > children output ordering information (inherit from > `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in > complex queries involved multiple joins. > Example: > > {code:java} > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") { > val df1 = spark.range(100).select($"id".as("k1")) > val df2 = spark.range(100).select($"id".as("k2")) > val df3 = spark.range(3).select($"id".as("k3")) > val df4 = spark.range(100).select($"id".as("k4")) > val plan = df1.join(df2, $"k1" === $"k2") > .join(df3, $"k1" === $"k3") > .join(df4, $"k1" === $"k4") > .queryExecution > .executedPlan > } > {code} > > Current physical plan (extra sort on `k1` before top sort merge join): > {code:java} > *(9) SortMergeJoin [k1#220L], [k4#232L], Inner > :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0 > : +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight > : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner > : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 > : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128] > : : : +- *(1) Project [id#218L AS k1#220L] > : : :+- *(1) Range (0, 100, step=1, splits=2) > : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134] > : :+- *(3) Project [id#222L AS k2#224L] > : : +- *(3) Range (0, 100, step=1, splits=2) > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])), [id=#141] > :+- *(5) Project [id#226L AS k3#228L] > : +- *(5) Range (0, 3, step=1, splits=2) > +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(k4#232L, 5), true, [id=#148] > +- *(7) Project [id#230L AS k4#232L] > +- *(7) Range (0, 100, step=1, splits=2) > {code} > Ideal physical plan (no extra sort on `k1` before top sort merge join): > {code:java} > *(9) SortMergeJoin [k1#220L], [k4#232L], Inner > :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight > : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner > : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 > : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127] > : : : +- *(1) Project [id#218L AS k1#220L] > : : :+- *(1) Range (0, 100, step=1, splits=2) > : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133] > : :+- *(3) Project [id#222L AS k2#224L] > : : +- *(3) Range (0, 100, step=1, splits=2) > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])), [id=#140] > : +- *(5) Project [id#226L AS k3#228L] > :+- *(5) Range (0, 3, step=1, splits=2) > +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(k4#232L, 5), true, [id=#146] > +- *(7) Project [id#230L AS k4#232L] > +- *(7) Range (0, 100, step=1, splits=2){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32420) Add handling for unique key in non-codegen hash join
[ https://issues.apache.org/jira/browse/SPARK-32420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-32420: - Parent: SPARK-32461 Issue Type: Sub-task (was: Improvement) > Add handling for unique key in non-codegen hash join > > > Key: SPARK-32420 > URL: https://issues.apache.org/jira/browse/SPARK-32420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.1.0 > > > `HashRelation` has two separate code paths for unique key look up and > non-unique key look up E.g. in its subclass > `UnsafeHashedRelation`([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177]), > unique key look up is more efficient as it does not have extra > `Iterator[UnsafeRow].hasNext()/next()` overhead per row. > `BroadcastHashJoinExec` has handled unique key vs non-unique key separately > in code-gen path > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321]). > But the non-codegen path for broadcast hash join and shuffled hash join do > not separate it yet, so adding the support here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32399) Support full outer join in shuffled hash join and broadcast hash join
[ https://issues.apache.org/jira/browse/SPARK-32399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-32399: - Parent: SPARK-32461 Issue Type: Sub-task (was: Improvement) > Support full outer join in shuffled hash join and broadcast hash join > - > > Key: SPARK-32399 > URL: https://issues.apache.org/jira/browse/SPARK-32399 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > Currently for SQL full outer join, spark always does a sort merge join no > matter of how large the join children size are. Inspired by recent discussion > in [https://github.com/apache/spark/pull/29130#discussion_r456502678] and > [https://github.com/apache/spark/pull/29181], I think we can support full > outer join in shuffled hash join and broadcast hash join in a way that - when > looking up stream side keys from build side {{HashedRelation}}. Mark this > info inside build side {{HashedRelation}}, and after reading all rows from > stream side, output all non-matching rows from build side based on modified > {{HashedRelation}}. But more design details need to be figured out for this > JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28210) Shuffle Storage API: Reads
[ https://issues.apache.org/jira/browse/SPARK-28210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166027#comment-17166027 ] Tianchen Zhang edited comment on SPARK-28210 at 7/27/20, 11:40 PM: --- Hi [~devaraj], do you mind sharing some ideas about your change? Is it based on the linked PR in this JIRA? We are also finding a way to have our own storage implementation but are blocked by the lack of reader's API. Thanks. was (Author: tianczha): Hi [~devaraj], do you mind share some ideas about your change? Is it based on the linked PR in this JIRA? We are also finding a way to have our own storage implementation but are blocked by the lack of reader's API. Thanks. > Shuffle Storage API: Reads > -- > > Key: SPARK-28210 > URL: https://issues.apache.org/jira/browse/SPARK-28210 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > > As part of the effort to store shuffle data in arbitrary places, this issue > tracks implementing an API for reading the shuffle data stored by the write > API. Also ensure that the existing shuffle implementation is refactored to > use the API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28210) Shuffle Storage API: Reads
[ https://issues.apache.org/jira/browse/SPARK-28210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166027#comment-17166027 ] Tianchen Zhang commented on SPARK-28210: Hi [~devaraj], do you mind share some ideas about your change? Is it based on the linked PR in this JIRA? We are also finding a way to have our own storage implementation but are blocked by the lack of reader's API. Thanks. > Shuffle Storage API: Reads > -- > > Key: SPARK-28210 > URL: https://issues.apache.org/jira/browse/SPARK-28210 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > > As part of the effort to store shuffle data in arbitrary places, this issue > tracks implementing an API for reading the shuffle data stored by the write > API. Also ensure that the existing shuffle implementation is refactored to > use the API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32461) Shuffled hash join improvement
Cheng Su created SPARK-32461: Summary: Shuffled hash join improvement Key: SPARK-32461 URL: https://issues.apache.org/jira/browse/SPARK-32461 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Cheng Su Shuffled hash join avoids sort compared to sort merge join. This advantage shows up obviously when joining large table in terms of saving CPU and IO (in case of external sort happens). In latest master trunk, shuffled hash join is disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, with favor of reducing risk of OOM. However shuffled hash join could be improved to a better state (validated in our internal fork). Creating this Jira to track overall progress. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32330) Preserve shuffled hash join build side partitioning
[ https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-32330: - Parent: SPARK-32461 Issue Type: Sub-task (was: Improvement) > Preserve shuffled hash join build side partitioning > --- > > Key: SPARK-32330 > URL: https://issues.apache.org/jira/browse/SPARK-32330 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Major > Fix For: 3.1.0 > > > Currently `ShuffledHashJoin.outputPartitioning` inherits from > `HashJoin.outputPartitioning`, which only preserves stream side partitioning: > `HashJoin.scala` > {code:java} > override def outputPartitioning: Partitioning = > streamedPlan.outputPartitioning > {code} > This loses build side partitioning information, and causes extra shuffle if > there's another join / group-by after this join. > Example: > > {code:java} > // code placeholder > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "2", > SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { > val df1 = spark.range(10).select($"id".as("k1")) > val df2 = spark.range(30).select($"id".as("k2")) > Seq("inner", "cross").foreach(joinType => { > val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() > .queryExecution.executedPlan > assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) > // No extra shuffle before aggregate > assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) > }) > }{code} > > Current physical plan (having an extra shuffle on `k1` before aggregate) > > {code:java} > *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] >+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) > +- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] >+- *(2) Project [id#222L AS k2#224L] > +- *(2) Range (0, 30, step=1, splits=2){code} > > Ideal physical plan (no shuffle on `k1` before aggregate) > {code:java} > *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) >+- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109] > +- *(2) Project [id#222L AS k2#224L] >+- *(2) Range (0, 30, step=1, splits=2){code} > > This can be fixed by overriding `outputPartitioning` method in > `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32286) Coalesce bucketed tables for shuffled hash join if applicable
[ https://issues.apache.org/jira/browse/SPARK-32286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-32286: - Parent: SPARK-32461 Issue Type: Sub-task (was: Improvement) > Coalesce bucketed tables for shuffled hash join if applicable > - > > Key: SPARK-32286 > URL: https://issues.apache.org/jira/browse/SPARK-32286 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.1.0 > > > Based on a follow up comment in PR > [#28123|https://github.com/apache/spark/pull/28123], where we can coalesce > buckets for shuffled hash join as well. The note here is we only coalesce the > buckets from shuffled hash join stream side (i.e. the side not building hash > map), so we don't need to worry about OOM when coalescing multiple buckets in > one task for building hash map. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32462) Don't save the previous search text for datatable
[ https://issues.apache.org/jira/browse/SPARK-32462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32462: Assignee: Kousuke Saruta (was: Apache Spark) > Don't save the previous search text for datatable > - > > Key: SPARK-32462 > URL: https://issues.apache.org/jira/browse/SPARK-32462 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > DataTable is used in stage-page and executors-page for pagination and filter > tasks/executors by search text. > In the current implementation, search text is saved so if we visit stage-page > for a job, the previous search text is filled in the textbox and the task > table is filtered. > I'm sometimes surprised by this behavior as the stage-page lists no tasks > because tasks are filtered by the previous search text. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32462) Don't save the previous search text for datatable
[ https://issues.apache.org/jira/browse/SPARK-32462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166025#comment-17166025 ] Apache Spark commented on SPARK-32462: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/29265 > Don't save the previous search text for datatable > - > > Key: SPARK-32462 > URL: https://issues.apache.org/jira/browse/SPARK-32462 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > DataTable is used in stage-page and executors-page for pagination and filter > tasks/executors by search text. > In the current implementation, search text is saved so if we visit stage-page > for a job, the previous search text is filled in the textbox and the task > table is filtered. > I'm sometimes surprised by this behavior as the stage-page lists no tasks > because tasks are filtered by the previous search text. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32462) Don't save the previous search text for datatable
[ https://issues.apache.org/jira/browse/SPARK-32462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32462: Assignee: Apache Spark (was: Kousuke Saruta) > Don't save the previous search text for datatable > - > > Key: SPARK-32462 > URL: https://issues.apache.org/jira/browse/SPARK-32462 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > DataTable is used in stage-page and executors-page for pagination and filter > tasks/executors by search text. > In the current implementation, search text is saved so if we visit stage-page > for a job, the previous search text is filled in the textbox and the task > table is filtered. > I'm sometimes surprised by this behavior as the stage-page lists no tasks > because tasks are filtered by the previous search text. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31753) Add missing keywords in the SQL documents
[ https://issues.apache.org/jira/browse/SPARK-31753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-31753: - Affects Version/s: (was: 3.0.0) 3.0.1 > Add missing keywords in the SQL documents > - > > Key: SPARK-31753 > URL: https://issues.apache.org/jira/browse/SPARK-31753 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > Some keywords are missing in the SQL documents and a list of them is as > follows. > [https://github.com/apache/spark/pull/28290#issuecomment-619321301] > {code:java} > AFTER > CASE/ELSE > WHEN/THEN > IGNORE NULLS > LATERAL VIEW (OUTER)? > MAP KEYS TERMINATED BY > NULL DEFINED AS > LINES TERMINATED BY > ESCAPED BY > COLLECTION ITEMS TERMINATED BY > EXPLAIN LOGICAL > PIVOT > {code} > They should be documented there, too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31753) Add missing keywords in the SQL documents
[ https://issues.apache.org/jira/browse/SPARK-31753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-31753: - Affects Version/s: 3.0.0 > Add missing keywords in the SQL documents > - > > Key: SPARK-31753 > URL: https://issues.apache.org/jira/browse/SPARK-31753 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > Some keywords are missing in the SQL documents and a list of them is as > follows. > [https://github.com/apache/spark/pull/28290#issuecomment-619321301] > {code:java} > AFTER > CASE/ELSE > WHEN/THEN > IGNORE NULLS > LATERAL VIEW (OUTER)? > MAP KEYS TERMINATED BY > NULL DEFINED AS > LINES TERMINATED BY > ESCAPED BY > COLLECTION ITEMS TERMINATED BY > EXPLAIN LOGICAL > PIVOT > {code} > They should be documented there, too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31753) Add missing keywords in the SQL documents
[ https://issues.apache.org/jira/browse/SPARK-31753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31753. -- Fix Version/s: 3.0.1 Assignee: philipse Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/29056] > Add missing keywords in the SQL documents > - > > Key: SPARK-31753 > URL: https://issues.apache.org/jira/browse/SPARK-31753 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: philipse >Priority: Minor > Fix For: 3.0.1 > > > Some keywords are missing in the SQL documents and a list of them is as > follows. > [https://github.com/apache/spark/pull/28290#issuecomment-619321301] > {code:java} > AFTER > CASE/ELSE > WHEN/THEN > IGNORE NULLS > LATERAL VIEW (OUTER)? > MAP KEYS TERMINATED BY > NULL DEFINED AS > LINES TERMINATED BY > ESCAPED BY > COLLECTION ITEMS TERMINATED BY > EXPLAIN LOGICAL > PIVOT > {code} > They should be documented there, too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32463) Document Data Type inference rule in SQL reference
Huaxin Gao created SPARK-32463: -- Summary: Document Data Type inference rule in SQL reference Key: SPARK-32463 URL: https://issues.apache.org/jira/browse/SPARK-32463 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.1.0 Reporter: Huaxin Gao Document Data Type inference rule in SQL reference, under Data Types section. Please see this PR https://github.com/apache/spark/pull/28896 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32464) Support skew handling on join with one side that has no query stage
Wang, Gang created SPARK-32464: -- Summary: Support skew handling on join with one side that has no query stage Key: SPARK-32464 URL: https://issues.apache.org/jira/browse/SPARK-32464 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wang, Gang In our production environment, there are many bucket tables, which are used to join with other tables. And there are some skewed joins now and then. While, in current implementation, the skew join handling can only applied when both sides of a SMJ are QueryStages. So skew join handling is not able to deal with such cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32439) Override datasource implementation during look up via configuration
[ https://issues.apache.org/jira/browse/SPARK-32439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32439. -- Resolution: Won't Fix > Override datasource implementation during look up via configuration > --- > > Key: SPARK-32439 > URL: https://issues.apache.org/jira/browse/SPARK-32439 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > We need a mechanism to override the datasource implementation via > configuration. > For example, suppose I have a custom CSV datasource implementation called > "my_csv". One way to use it is: > {code} > val df = spark.read.format("my_csv").load(...) > {code} > Since the source data is the same format (CSV), you should be able to > override the default implementation. > One proposal is to do the following: > {code} > spark.conf.set("spark.sql.datasource.override.csv", "my_csv") val df = > spark.read.csv(...) > {code} > This has a benefit that the user doesn't have to change the application code > to try out a new datasource implementation for the same source format. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32361) Remove project if output is subset of child
[ https://issues.apache.org/jira/browse/SPARK-32361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166094#comment-17166094 ] Hyukjin Kwon commented on SPARK-32361: -- Please fill JIRA description. > Remove project if output is subset of child > --- > > Key: SPARK-32361 > URL: https://issues.apache.org/jira/browse/SPARK-32361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32369) pyspark foreach/foreachPartition send http request failed
[ https://issues.apache.org/jira/browse/SPARK-32369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32369: - Description: I use urllib.request to send http request in foreach/foreachPartition. pyspark throw error as follow:I use urllib.request to send http request in foreach/foreachPartition. pyspark throw error as follow: {code} _objc[74094]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.20/07/20 19:05:58 ERROR Executor: Exception in task 7.0 in stage 0.0 (TID 7)org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:536) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:525) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:643) at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1004) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2133)_ {code} when i call rdd.foreach(send_http), rdd=sc.parallelize(["http://192.168.1.1:5000/index.html;]), send_http defined as follow: _def send_http(url):_ _req = urllib.request.Request(url)_ _resp = urllib.request.urlopen(req)_ anyone can tell me the problem? thanks. was: I use urllib.request to send http request in foreach/foreachPartition. pyspark throw error as follow:I use urllib.request to send http request in foreach/foreachPartition. pyspark throw error as follow: _objc[74094]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.20/07/20 19:05:58 ERROR Executor: Exception in task 7.0 in stage 0.0 (TID 7)org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:536) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:525) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:643) at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at
[jira] [Commented] (SPARK-32359) Implement max_error metric evaluator for spark regression mllib
[ https://issues.apache.org/jira/browse/SPARK-32359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166095#comment-17166095 ] Hyukjin Kwon commented on SPARK-32359: -- Please fill JIRA description. > Implement max_error metric evaluator for spark regression mllib > --- > > Key: SPARK-32359 > URL: https://issues.apache.org/jira/browse/SPARK-32359 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Sayed Mohammad Hossein Torabi >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32423) class 'DataFrame' returns instance of type(self) instead of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-32423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166114#comment-17166114 ] Hyukjin Kwon commented on SPARK-32423: -- Can you show some pseudo codes? It's a bit difficult to follow what the JIRA targets. > class 'DataFrame' returns instance of type(self) instead of DataFrame > -- > > Key: SPARK-32423 > URL: https://issues.apache.org/jira/browse/SPARK-32423 > Project: Spark > Issue Type: Wish > Components: PySpark >Affects Versions: 2.4.6, 3.0.0 >Reporter: Timothy >Priority: Minor > > To allow for appropriate child classing of DataFrame, I propose the following > change: > class 'DataFrame' returns instance of type(self) instead of typeDataFrame > > Therefore child classes using methods such as '.limit()' will return an > instance of the child class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32463) Document Data Type inference rule in SQL reference
[ https://issues.apache.org/jira/browse/SPARK-32463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166045#comment-17166045 ] Huaxin Gao commented on SPARK-32463: [~planga82] You are welcomed to work on this if you have free time. Thanks! > Document Data Type inference rule in SQL reference > -- > > Key: SPARK-32463 > URL: https://issues.apache.org/jira/browse/SPARK-32463 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Document Data Type inference rule in SQL reference, under Data Types section. > Please see this PR https://github.com/apache/spark/pull/28896 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32464) Support skew handling on join that has one side with no query stage
[ https://issues.apache.org/jira/browse/SPARK-32464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang, Gang updated SPARK-32464: --- Summary: Support skew handling on join that has one side with no query stage (was: Support skew handling on join with one side that has no query stage) > Support skew handling on join that has one side with no query stage > --- > > Key: SPARK-32464 > URL: https://issues.apache.org/jira/browse/SPARK-32464 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wang, Gang >Priority: Major > > In our production environment, there are many bucket tables, which are used > to join with other tables. And there are some skewed joins now and then. > While, in current implementation, the skew join handling can only applied > when both sides of a SMJ are QueryStages. So skew join handling is not able > to deal with such cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32464) Support skew handling on join that has one side with no query stage
[ https://issues.apache.org/jira/browse/SPARK-32464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32464: Assignee: (was: Apache Spark) > Support skew handling on join that has one side with no query stage > --- > > Key: SPARK-32464 > URL: https://issues.apache.org/jira/browse/SPARK-32464 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wang, Gang >Priority: Major > > In our production environment, there are many bucket tables, which are used > to join with other tables. And there are some skewed joins now and then. > While, in current implementation, the skew join handling can only applied > when both sides of a SMJ are QueryStages. So skew join handling is not able > to deal with such cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32464) Support skew handling on join that has one side with no query stage
[ https://issues.apache.org/jira/browse/SPARK-32464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32464: Assignee: Apache Spark > Support skew handling on join that has one side with no query stage > --- > > Key: SPARK-32464 > URL: https://issues.apache.org/jira/browse/SPARK-32464 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wang, Gang >Assignee: Apache Spark >Priority: Major > > In our production environment, there are many bucket tables, which are used > to join with other tables. And there are some skewed joins now and then. > While, in current implementation, the skew join handling can only applied > when both sides of a SMJ are QueryStages. So skew join handling is not able > to deal with such cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32464) Support skew handling on join that has one side with no query stage
[ https://issues.apache.org/jira/browse/SPARK-32464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32464: Assignee: Apache Spark > Support skew handling on join that has one side with no query stage > --- > > Key: SPARK-32464 > URL: https://issues.apache.org/jira/browse/SPARK-32464 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wang, Gang >Assignee: Apache Spark >Priority: Major > > In our production environment, there are many bucket tables, which are used > to join with other tables. And there are some skewed joins now and then. > While, in current implementation, the skew join handling can only applied > when both sides of a SMJ are QueryStages. So skew join handling is not able > to deal with such cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org