[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165653#comment-17165653
 ] 

Apache Spark commented on SPARK-29302:
--

User 'WinkerDu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29260

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32457:


Assignee: Apache Spark

> logParam thresholds in  DT/GBT/FM/LR/MLP
> 
>
> Key: SPARK-32457
> URL: https://issues.apache.org/jira/browse/SPARK-32457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30794) Stage Level scheduling: Add ability to set off heap memory

2020-07-27 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-30794:
-

Assignee: Zhongwei Zhu

> Stage Level scheduling: Add ability to set off heap memory
> --
>
> Key: SPARK-30794
> URL: https://issues.apache.org/jira/browse/SPARK-30794
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Zhongwei Zhu
>Priority: Major
>
> For stage level scheduling in ExecutorResourceRequests we support setting 
> heap memory, pyspark memory, and memory overhead. We have no split out off 
> heap memory as its own configuration so we should add it as an option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30794) Stage Level scheduling: Add ability to set off heap memory

2020-07-27 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-30794.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

> Stage Level scheduling: Add ability to set off heap memory
> --
>
> Key: SPARK-30794
> URL: https://issues.apache.org/jira/browse/SPARK-30794
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Assignee: Zhongwei Zhu
>Priority: Major
> Fix For: 3.1.0
>
>
> For stage level scheduling in ExecutorResourceRequests we support setting 
> heap memory, pyspark memory, and memory overhead. We have no split out off 
> heap memory as its own configuration so we should add it as an option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165611#comment-17165611
 ] 

Apache Spark commented on SPARK-29918:
--

User 'mundaym' has created a pull request for this issue:
https://github.com/apache/spark/pull/29259

> RecordBinaryComparator should check endianness when compared by long
> 
>
> Key: SPARK-29918
> URL: https://issues.apache.org/jira/browse/SPARK-29918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Minor
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> If the architecture supports unaligned or the offset is 8 bytes aligned, 
> RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a 
> long. Otherwise, it will compare bytes by bytes. 
> However, on little-endian machine,  the result of compared by a long value 
> and compared bytes by bytes maybe different. If the architectures in a yarn 
> cluster is different(Some is unaligned-access capable while others not), then 
> the sequence of two records after sorted is undetermined, which will result 
> in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32459) UDF regression of WrappedArray supporting caused by SPARK-31826

2020-07-27 Thread wuyi (Jira)
wuyi created SPARK-32459:


 Summary: UDF regression of WrappedArray supporting caused by 
SPARK-31826
 Key: SPARK-32459
 URL: https://issues.apache.org/jira/browse/SPARK-32459
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: wuyi


 
{code:java}
test("WrappedArray") {
  val myUdf = udf((a: WrappedArray[Int]) =>
WrappedArray.make[Int](Array(a.head + 99)))
  checkAnswer(Seq(Array(1))
.toDF("col")
.select(myUdf(Column("col"))),
Row(ArrayBuffer(100)))
}{code}
Execute the above test in master branch, we'll hit the error:


{code:java}
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
0.0 (TID 0, 192.168.101.3, executor driver): java.lang.RuntimeException: Error 
while decoding: java.lang.ClassCastException: 
scala.collection.mutable.ArrayBuffer cannot be cast to 
scala.collection.mutable.WrappedArray[info]  {code}
However, the test can be executed successfully in branch-3.0.

 

This's actually a regression caused by SPARK-31826. And the regression happens 
after we changed the catalyst-to-scala converter from CatalystTypeConverters to 
ExpressionEncoder.deserializer .

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32458) Mismatched row access sizes in tests

2020-07-27 Thread Michael Munday (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Munday updated SPARK-32458:
---
Description: 
The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This 
is because the test data is written into the row using unsafe operations with 
one size and then read back using a different size. For example, in 
UnsafeMapSuite the test data is written using putLong and then read back using 
getInt. This happens to work on little-endian systems but these differences 
appear to be typos and cause the tests to fail on big-endian systems.

I have a patch that fixes the issue.

  was:
The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This 
is because the test data is written into the row using unsafe operations with 
one size and then read back using a different size. For example, in 
UnsafeMapSuite the test data is written using putLong and then read back using 
getInt. This happens to work on little-endian systems but these differences 
appear to be typos and causes the tests to fail on big-endian systems.

I have a patch that fixes the issue.


> Mismatched row access sizes in tests
> 
>
> Key: SPARK-32458
> URL: https://issues.apache.org/jira/browse/SPARK-32458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Michael Munday
>Priority: Minor
>  Labels: catalyst, endianness
>
> The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This 
> is because the test data is written into the row using unsafe operations with 
> one size and then read back using a different size. For example, in 
> UnsafeMapSuite the test data is written using putLong and then read back 
> using getInt. This happens to work on little-endian systems but these 
> differences appear to be typos and cause the tests to fail on big-endian 
> systems.
> I have a patch that fixes the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32458) Mismatched row access sizes in tests

2020-07-27 Thread Michael Munday (Jira)
Michael Munday created SPARK-32458:
--

 Summary: Mismatched row access sizes in tests
 Key: SPARK-32458
 URL: https://issues.apache.org/jira/browse/SPARK-32458
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Michael Munday


The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This 
is because the test data is written into the row using unsafe operations with 
one size and then read back using a different size. For example, in 
UnsafeMapSuite the test data is written using putLong and then read back using 
getInt. This happens to work on little-endian systems but these differences 
appear to be typos and causes the tests to fail on big-endian systems.

I have a patch that fixes the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19169) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files

2020-07-27 Thread bianqi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165657#comment-17165657
 ] 

bianqi commented on SPARK-19169:


[~hyukjin.kwon] hello We also encountered this problem in the production 
environment. 

 
{quote}java.lang.IndexOutOfBoundsException: toIndex = 63 at 
java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) at 
java.util.ArrayList.subList(ArrayList.java:996) at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
 at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
 at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) 
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:541) 
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
 at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
 at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
 at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1273)
 at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1170)
 at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257) 
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256) at 
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) at 
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748){quote}

> columns changed orc table encouter 'IndexOutOfBoundsException' when read the 
> old schema files
> -
>
> Key: SPARK-19169
> URL: https://issues.apache.org/jira/browse/SPARK-19169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: roncenzhao
>Priority: Major
>
> We hava an orc table called orc_test_tbl and hava inserted some data into it.
> After that, we change the table schema by droping some columns.
> When reading the old schema file, we get the follow exception.
> ```
> java.lang.IndexOutOfBoundsException: toIndex = 65
> at java.util.ArrayList.subListRangeCheck(ArrayList.java:962)
> at java.util.ArrayList.subList(ArrayList.java:954)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
> at 
> 

[jira] [Commented] (SPARK-32458) Mismatched row access sizes in tests

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165604#comment-17165604
 ] 

Apache Spark commented on SPARK-32458:
--

User 'mundaym' has created a pull request for this issue:
https://github.com/apache/spark/pull/29258

> Mismatched row access sizes in tests
> 
>
> Key: SPARK-32458
> URL: https://issues.apache.org/jira/browse/SPARK-32458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Michael Munday
>Priority: Minor
>  Labels: catalyst, endianness
>
> The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This 
> is because the test data is written into the row using unsafe operations with 
> one size and then read back using a different size. For example, in 
> UnsafeMapSuite the test data is written using putLong and then read back 
> using getInt. This happens to work on little-endian systems but these 
> differences appear to be typos and cause the tests to fail on big-endian 
> systems.
> I have a patch that fixes the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32458) Mismatched row access sizes in tests

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32458:


Assignee: (was: Apache Spark)

> Mismatched row access sizes in tests
> 
>
> Key: SPARK-32458
> URL: https://issues.apache.org/jira/browse/SPARK-32458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Michael Munday
>Priority: Minor
>  Labels: catalyst, endianness
>
> The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This 
> is because the test data is written into the row using unsafe operations with 
> one size and then read back using a different size. For example, in 
> UnsafeMapSuite the test data is written using putLong and then read back 
> using getInt. This happens to work on little-endian systems but these 
> differences appear to be typos and cause the tests to fail on big-endian 
> systems.
> I have a patch that fixes the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32458) Mismatched row access sizes in tests

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32458:


Assignee: Apache Spark

> Mismatched row access sizes in tests
> 
>
> Key: SPARK-32458
> URL: https://issues.apache.org/jira/browse/SPARK-32458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Michael Munday
>Assignee: Apache Spark
>Priority: Minor
>  Labels: catalyst, endianness
>
> The RowEncoderSuite and UnsafeMapSuite tests fail on big-endian systems. This 
> is because the test data is written into the row using unsafe operations with 
> one size and then read back using a different size. For example, in 
> UnsafeMapSuite the test data is written using putLong and then read back 
> using getInt. This happens to work on little-endian systems but these 
> differences appear to be typos and cause the tests to fail on big-endian 
> systems.
> I have a patch that fixes the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165614#comment-17165614
 ] 

Apache Spark commented on SPARK-29918:
--

User 'mundaym' has created a pull request for this issue:
https://github.com/apache/spark/pull/29259

> RecordBinaryComparator should check endianness when compared by long
> 
>
> Key: SPARK-29918
> URL: https://issues.apache.org/jira/browse/SPARK-29918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: EdisonWang
>Assignee: EdisonWang
>Priority: Minor
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> If the architecture supports unaligned or the offset is 8 bytes aligned, 
> RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a 
> long. Otherwise, it will compare bytes by bytes. 
> However, on little-endian machine,  the result of compared by a long value 
> and compared bytes by bytes maybe different. If the architectures in a yarn 
> cluster is different(Some is unaligned-access capable while others not), then 
> the sequence of two records after sorted is undetermined, which will result 
> in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32435) Remove heapq3 port from Python 3

2020-07-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32435.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29229
[https://github.com/apache/spark/pull/29229]

> Remove heapq3 port from Python 3
> 
>
> Key: SPARK-32435
> URL: https://issues.apache.org/jira/browse/SPARK-32435
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> We have the manual port heapq3 
> https://github.com/apache/spark/blob/c3be2cd347c42972d9c499b6fd9a6f988f80af12/python/pyspark/heapq3.py
>  to support {{key}} and {{reverse}}.
> It should be removed together since Python 2 was dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32435) Remove heapq3 port from Python 3

2020-07-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32435:


Assignee: Hyukjin Kwon

> Remove heapq3 port from Python 3
> 
>
> Key: SPARK-32435
> URL: https://issues.apache.org/jira/browse/SPARK-32435
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We have the manual port heapq3 
> https://github.com/apache/spark/blob/c3be2cd347c42972d9c499b6fd9a6f988f80af12/python/pyspark/heapq3.py
>  to support {{key}} and {{reverse}}.
> It should be removed together since Python 2 was dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32425) Spark sequence() fails if start and end of range are identical timestamps

2020-07-27 Thread JinxinTang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165634#comment-17165634
 ] 

JinxinTang commented on SPARK-32425:


cc [~viirya]  Because I have fixed it by 
https://issues.apache.org/jira/browse/SPARK-31980#

:  )
 

> Spark sequence() fails if start and end of range are identical timestamps
> -
>
> Key: SPARK-32425
> URL: https://issues.apache.org/jira/browse/SPARK-32425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 on Databricks Runtime 7.1
>Reporter: Lauri Koobas
>Priority: Minor
>
> The following Spark SQL query throws an exception
> {code:java}
> select sequence(current_timestamp, current_timestamp)
> {code}
> The error is:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:78){noformat}
>  Essentially a copy of https://issues.apache.org/jira/browse/SPARK-31980#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165649#comment-17165649
 ] 

Apache Spark commented on SPARK-27194:
--

User 'WinkerDu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29260

> Job failures when task attempts do not clean up spark-staging parquet files
> ---
>
> Key: SPARK-27194
> URL: https://issues.apache.org/jira/browse/SPARK-27194
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1, 2.3.2, 2.3.3
>Reporter: Reza Safi
>Priority: Major
>
> When a container fails for some reason (for example when killed by yarn for 
> exceeding memory limits), the subsequent task attempts for the tasks that 
> were running on that container all fail with a FileAlreadyExistsException. 
> The original task attempt does not seem to successfully call abortTask (or at 
> least its "best effort" delete is unsuccessful) and clean up the parquet file 
> it was writing to, so when later task attempts try to write to the same 
> spark-staging directory using the same file name, the job fails.
> Here is what transpires in the logs:
> The container where task 200.0 is running is killed and the task is lost:
> {code}
> 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
> t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>  19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 
> 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 
> exited caused by one of the running tasks) Reason: Container killed by YARN 
> for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> {code}
> The task is re-attempted on a different executor and fails because the 
> part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
> already exists:
> {code}
> 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
> (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client a.b.c.d already exists
> {code}
> The job fails when the the configured task attempts (spark.task.maxFailures) 
> have failed with the same error:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 
> in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 
> 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  ...
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client i.p.a.d already exists
> {code}
> SPARK-26682 wasn't the root cause here, since there wasn't any stage 
> reattempt.
> This issue seems to happen when 
> spark.sql.sources.partitionOverwriteMode=dynamic. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165651#comment-17165651
 ] 

Apache Spark commented on SPARK-29302:
--

User 'WinkerDu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29260

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20680) Spark-sql do not support for void column datatype of view

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165669#comment-17165669
 ] 

Apache Spark commented on SPARK-20680:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/29244

> Spark-sql do not support for void column datatype of view
> -
>
> Key: SPARK-20680
> URL: https://issues.apache.org/jira/browse/SPARK-20680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.4.6, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.1.0
>
>
> Create a HIVE view:
> {quote}
> hive> create table bad as select 1 x, null z from dual;
> {quote}
> Because there's no type, Hive gives it the VOID type:
> {quote}
> hive> describe bad;
> OK
> x int 
> z void
> {quote}
> In Spark2.0.x, the behaviour to read this view is normal:
> {quote}
> spark-sql> describe bad;
> x   int NULL
> z   voidNULL
> Time taken: 4.431 seconds, Fetched 2 row(s)
> {quote}
> But in Spark2.1.x, it failed with SparkException: Cannot recognize hive type 
> string: void
> {quote}
> spark-sql> describe bad;
> 17/05/09 03:12:08 INFO execution.SparkSqlParser: Parsing command: describe bad
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: int
> 17/05/09 03:12:08 INFO parser.CatalystSqlParser: Parsing command: void
> 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
>   
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
> DataType void() is not supported.(line 1, pos 0)
> == SQL ==  
> void   
> ^^^
> ... 61 more
> org.apache.spark.SparkException: Cannot recognize hive type string: void
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31851) Redesign PySpark documentation

2020-07-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165546#comment-17165546
 ] 

Hyukjin Kwon commented on SPARK-31851:
--

The base work is done. I will create one more PR soon to show an example of the 
documentation so that people can easily follow.

> Redesign PySpark documentation
> --
>
> Key: SPARK-31851
> URL: https://issues.apache.org/jira/browse/SPARK-31851
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Currently, PySpark documentation 
> (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
> poorly written compared to other projects.
> See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
> exmaple.
> PySpark is being more and more important in Spark, and we should improve this 
> documentation so people can easily follow.
> Reference: 
> - https://koalas.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32455) LogisticRegressionModel prediction optimization

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165552#comment-17165552
 ] 

Apache Spark commented on SPARK-32455:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29255

> LogisticRegressionModel prediction optimization
> ---
>
> Key: SPARK-32455
> URL: https://issues.apache.org/jira/browse/SPARK-32455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> if needed, method getThreshold and/or following logic to compute rawThreshold 
> is called on each instance.
>  
> {code:java}
> override def getThreshold: Double = {
>   checkThresholdConsistency()
>   if (isSet(thresholds)) {
> val ts = $(thresholds)
> require(ts.length == 2, "Logistic Regression getThreshold only applies 
> to" +
>   " binary classification, but thresholds has length != 2.  thresholds: " 
> + ts.mkString(","))
> 1.0 / (1.0 + ts(0) / ts(1))
>   } else {
> $(threshold)
>   }
> } {code}
>  
> {code:java}
>   val rawThreshold = if (t == 0.0) {
> Double.NegativeInfinity
>   } else if (t == 1.0) {
> Double.PositiveInfinity
>   } else {
> math.log(t / (1.0 - t))
>   } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32455) LogisticRegressionModel prediction optimization

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165553#comment-17165553
 ] 

Apache Spark commented on SPARK-32455:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29255

> LogisticRegressionModel prediction optimization
> ---
>
> Key: SPARK-32455
> URL: https://issues.apache.org/jira/browse/SPARK-32455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> if needed, method getThreshold and/or following logic to compute rawThreshold 
> is called on each instance.
>  
> {code:java}
> override def getThreshold: Double = {
>   checkThresholdConsistency()
>   if (isSet(thresholds)) {
> val ts = $(thresholds)
> require(ts.length == 2, "Logistic Regression getThreshold only applies 
> to" +
>   " binary classification, but thresholds has length != 2.  thresholds: " 
> + ts.mkString(","))
> 1.0 / (1.0 + ts(0) / ts(1))
>   } else {
> $(threshold)
>   }
> } {code}
>  
> {code:java}
>   val rawThreshold = if (t == 0.0) {
> Double.NegativeInfinity
>   } else if (t == 1.0) {
> Double.PositiveInfinity
>   } else {
> math.log(t / (1.0 - t))
>   } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32455) LogisticRegressionModel prediction optimization

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32455:


Assignee: (was: Apache Spark)

> LogisticRegressionModel prediction optimization
> ---
>
> Key: SPARK-32455
> URL: https://issues.apache.org/jira/browse/SPARK-32455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> if needed, method getThreshold and/or following logic to compute rawThreshold 
> is called on each instance.
>  
> {code:java}
> override def getThreshold: Double = {
>   checkThresholdConsistency()
>   if (isSet(thresholds)) {
> val ts = $(thresholds)
> require(ts.length == 2, "Logistic Regression getThreshold only applies 
> to" +
>   " binary classification, but thresholds has length != 2.  thresholds: " 
> + ts.mkString(","))
> 1.0 / (1.0 + ts(0) / ts(1))
>   } else {
> $(threshold)
>   }
> } {code}
>  
> {code:java}
>   val rawThreshold = if (t == 0.0) {
> Double.NegativeInfinity
>   } else if (t == 1.0) {
> Double.PositiveInfinity
>   } else {
> math.log(t / (1.0 - t))
>   } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32455) LogisticRegressionModel prediction optimization

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32455:


Assignee: Apache Spark

> LogisticRegressionModel prediction optimization
> ---
>
> Key: SPARK-32455
> URL: https://issues.apache.org/jira/browse/SPARK-32455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> if needed, method getThreshold and/or following logic to compute rawThreshold 
> is called on each instance.
>  
> {code:java}
> override def getThreshold: Double = {
>   checkThresholdConsistency()
>   if (isSet(thresholds)) {
> val ts = $(thresholds)
> require(ts.length == 2, "Logistic Regression getThreshold only applies 
> to" +
>   " binary classification, but thresholds has length != 2.  thresholds: " 
> + ts.mkString(","))
> 1.0 / (1.0 + ts(0) / ts(1))
>   } else {
> $(threshold)
>   }
> } {code}
>  
> {code:java}
>   val rawThreshold = if (t == 0.0) {
> Double.NegativeInfinity
>   } else if (t == 1.0) {
> Double.PositiveInfinity
>   } else {
> math.log(t / (1.0 - t))
>   } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32456) Give better error message for union streams in append mode that don't have a watermark

2020-07-27 Thread Yuanjian Li (Jira)
Yuanjian Li created SPARK-32456:
---

 Summary: Give better error message for union streams in append 
mode that don't have a watermark
 Key: SPARK-32456
 URL: https://issues.apache.org/jira/browse/SPARK-32456
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Yuanjian Li


Check the following example:

 
{code:java}
val s1 = spark.readStream.format("rate").option("rowsPerSecond", 
1).load().createOrReplaceTempView("s1")
val s2 = spark.readStream.format("rate").option("rowsPerSecond", 
1).load().createOrReplaceTempView("s2")
val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select 
s2.value, s2.timestamp from s2")
unionResult.writeStream.option("checkpointLocation", ${pathA}).start(${pathB})
{code}
 

We'll get the following confusing exception:
{code:java}
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561)
at 
org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112)
...
{code}
The union clause in SQL has the requirement of deduplication, the parser will 
generate {{Distinct(Union)}} and the optimizer rule 
{{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So the 
root cause here is the checking logic for Aggregate is missing for Distinct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32417) Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already running task which is going to cache data succeeds on a decommissioned executor

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165466#comment-17165466
 ] 

Apache Spark commented on SPARK-32417:
--

User 'agrawaldevesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29226

> Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already 
> running task which is going to cache data succeeds on a decommissioned 
> executor
> ---
>
> Key: SPARK-32417
> URL: https://issues.apache.org/jira/browse/SPARK-32417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126424/testReport/
> {code:java}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 2759 times over 30.001772248 
> seconds. Last failure message: Map() was empty We should have a block that 
> has been on multiple BMs in rdds:  
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 
> replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) 
> from: 
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 
> amp-jenkins-worker-05.amp, 45854, 
> None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, 
> 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0)))  
> but instead we got:  Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1).
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 2759 times over 30.001772248 
> seconds. Last failure message: Map() was empty We should have a block that 
> has been on multiple BMs in rdds:
>  ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 
> replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) 
> from:
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 
> amp-jenkins-worker-05.amp, 45854, 
> None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, 
> 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0)))
>  but instead we got:
>  Map(rdd_1_0 -> 1, rdd_1_2 -> 1, 

[jira] [Assigned] (SPARK-32417) Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already running task which is going to cache data succeeds on a decommissioned executor

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32417:


Assignee: Apache Spark

> Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already 
> running task which is going to cache data succeeds on a decommissioned 
> executor
> ---
>
> Key: SPARK-32417
> URL: https://issues.apache.org/jira/browse/SPARK-32417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126424/testReport/
> {code:java}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 2759 times over 30.001772248 
> seconds. Last failure message: Map() was empty We should have a block that 
> has been on multiple BMs in rdds:  
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 
> replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) 
> from: 
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 
> amp-jenkins-worker-05.amp, 45854, 
> None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, 
> 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0)))  
> but instead we got:  Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1).
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 2759 times over 30.001772248 
> seconds. Last failure message: Map() was empty We should have a block that 
> has been on multiple BMs in rdds:
>  ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 
> replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) 
> from:
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 
> amp-jenkins-worker-05.amp, 45854, 
> None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, 
> 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0)))
>  but instead we got:
>  Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1).
>   at 
> 

[jira] [Issue Comment Deleted] (SPARK-24194) HadoopFsRelation cannot overwrite a path that is also being read from

2020-07-27 Thread philipse (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

philipse updated SPARK-24194:
-
Comment: was deleted

(was: Hi 

is the issue closed ? can i try it in product env?

 

Thanks)

> HadoopFsRelation cannot overwrite a path that is also being read from
> -
>
> Key: SPARK-24194
> URL: https://issues.apache.org/jira/browse/SPARK-24194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: spark master
>Reporter: yangz
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When
> {code:java}
> INSERT OVERWRITE TABLE territory_count_compare select * from 
> territory_count_compare where shop_count!=real_shop_count
> {code}
> And territory_count_compare is a table with parquet, there will be a error 
> Cannot overwrite a path that is also being read from
>  
> And in file MetastoreDataSourceSuite.scala, there have a test case
>  
>  
> {code:java}
> table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName)
> {code}
>  
> But when the table territory_count_compare is a common hive table, there is 
> no error. 
> So I think the reason is when insert overwrite into hadoopfs relation with 
> static partition, it first delete the partition in the output. But it should 
> be the time when the job commited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32434) Support Scala 2.13 in AbstractCommandBuilder and load-spark-env scripts

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165535#comment-17165535
 ] 

Apache Spark commented on SPARK-32434:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29254

> Support Scala 2.13 in AbstractCommandBuilder and load-spark-env scripts
> ---
>
> Key: SPARK-32434
> URL: https://issues.apache.org/jira/browse/SPARK-32434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32434) Support Scala 2.13 in AbstractCommandBuilder and load-spark-env scripts

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165536#comment-17165536
 ] 

Apache Spark commented on SPARK-32434:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29254

> Support Scala 2.13 in AbstractCommandBuilder and load-spark-env scripts
> ---
>
> Key: SPARK-32434
> URL: https://issues.apache.org/jira/browse/SPARK-32434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32453) Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect it in AppVeyor

2020-07-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165549#comment-17165549
 ] 

Hyukjin Kwon edited comment on SPARK-32453 at 7/27/20, 8:57 AM:


It will be superseded by https://github.com/apache/spark/pull/29254.


was (Author: hyukjin.kwon):
It will be fixed together at https://github.com/apache/spark/pull/29254.

> Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect 
> it in AppVeyor
> ---
>
> Key: SPARK-32453
> URL: https://issues.apache.org/jira/browse/SPARK-32453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> .\bin\spark-submit2.cmd --driver-java-options 
> "-Dlog4j.configuration=file:///%CD:\=/%/R/log4j.properties" --conf 
> spark.hadoop.fs.defaultFS="file:///" R\pkg\tests\run-all.R
> "Presence of build for multiple Scala versions detected ( and )."
> "Remove one of them or, set SPARK_SCALA_VERSION= in \spark-env.cmd."
> "Visit  for more details about setting environment variables in 
> spark-env.cmd."
> "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd."
> Command exited with code 1
> {code}
> The load-spark-env script fails as below in AppVeyor. It was temporarily 
> explicitly set but we should remove and detect it automatically in this 
> script.
> Possibly related to SPARK-26132, SPARK-32227 and SPARK-32434



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32455) LogisticRegressionModel prediction optimization

2020-07-27 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-32455:


 Summary: LogisticRegressionModel prediction optimization
 Key: SPARK-32455
 URL: https://issues.apache.org/jira/browse/SPARK-32455
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng


if needed, method getThreshold and/or following logic to compute rawThreshold 
is called on each instance.

 
{code:java}
override def getThreshold: Double = {
  checkThresholdConsistency()
  if (isSet(thresholds)) {
val ts = $(thresholds)
require(ts.length == 2, "Logistic Regression getThreshold only applies to" +
  " binary classification, but thresholds has length != 2.  thresholds: " + 
ts.mkString(","))
1.0 / (1.0 + ts(0) / ts(1))
  } else {
$(threshold)
  }
} {code}
 
{code:java}
  val rawThreshold = if (t == 0.0) {
Double.NegativeInfinity
  } else if (t == 1.0) {
Double.PositiveInfinity
  } else {
math.log(t / (1.0 - t))
  } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32453) Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect it in AppVeyor

2020-07-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32453.
--
Resolution: Invalid

> Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect 
> it in AppVeyor
> ---
>
> Key: SPARK-32453
> URL: https://issues.apache.org/jira/browse/SPARK-32453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> .\bin\spark-submit2.cmd --driver-java-options 
> "-Dlog4j.configuration=file:///%CD:\=/%/R/log4j.properties" --conf 
> spark.hadoop.fs.defaultFS="file:///" R\pkg\tests\run-all.R
> "Presence of build for multiple Scala versions detected ( and )."
> "Remove one of them or, set SPARK_SCALA_VERSION= in \spark-env.cmd."
> "Visit  for more details about setting environment variables in 
> spark-env.cmd."
> "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd."
> Command exited with code 1
> {code}
> The load-spark-env script fails as below in AppVeyor. It was temporarily 
> explicitly set but we should remove and detect it automatically in this 
> script.
> Possibly related to SPARK-26132, SPARK-32227 and SPARK-32434



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32453) Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect it in AppVeyor

2020-07-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165549#comment-17165549
 ] 

Hyukjin Kwon commented on SPARK-32453:
--

It will be fixed together at https://github.com/apache/spark/pull/29254.

> Remove SPARK_SCALA_VERSION environment and let load-spark-env scripts detect 
> it in AppVeyor
> ---
>
> Key: SPARK-32453
> URL: https://issues.apache.org/jira/browse/SPARK-32453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> {code}
> .\bin\spark-submit2.cmd --driver-java-options 
> "-Dlog4j.configuration=file:///%CD:\=/%/R/log4j.properties" --conf 
> spark.hadoop.fs.defaultFS="file:///" R\pkg\tests\run-all.R
> "Presence of build for multiple Scala versions detected ( and )."
> "Remove one of them or, set SPARK_SCALA_VERSION= in \spark-env.cmd."
> "Visit  for more details about setting environment variables in 
> spark-env.cmd."
> "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd."
> Command exited with code 1
> {code}
> The load-spark-env script fails as below in AppVeyor. It was temporarily 
> explicitly set but we should remove and detect it automatically in this 
> script.
> Possibly related to SPARK-26132, SPARK-32227 and SPARK-32434



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32456) Give better error message for union streams in append mode that don't have a watermark

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32456:


Assignee: Apache Spark

> Give better error message for union streams in append mode that don't have a 
> watermark
> --
>
> Key: SPARK-32456
> URL: https://issues.apache.org/jira/browse/SPARK-32456
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> Check the following example:
>  
> {code:java}
> val s1 = spark.readStream.format("rate").option("rowsPerSecond", 
> 1).load().createOrReplaceTempView("s1")
> val s2 = spark.readStream.format("rate").option("rowsPerSecond", 
> 1).load().createOrReplaceTempView("s2")
> val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select 
> s2.value, s2.timestamp from s2")
> unionResult.writeStream.option("checkpointLocation", ${pathA}).start(${pathB})
> {code}
>  
> We'll get the following confusing exception:
> {code:java}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:529)
>   at scala.None$.get(Option.scala:527)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112)
> ...
> {code}
> The union clause in SQL has the requirement of deduplication, the parser will 
> generate {{Distinct(Union)}} and the optimizer rule 
> {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So 
> the root cause here is the checking logic for Aggregate is missing for 
> Distinct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32456) Give better error message for union streams in append mode that don't have a watermark

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32456:


Assignee: (was: Apache Spark)

> Give better error message for union streams in append mode that don't have a 
> watermark
> --
>
> Key: SPARK-32456
> URL: https://issues.apache.org/jira/browse/SPARK-32456
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Check the following example:
>  
> {code:java}
> val s1 = spark.readStream.format("rate").option("rowsPerSecond", 
> 1).load().createOrReplaceTempView("s1")
> val s2 = spark.readStream.format("rate").option("rowsPerSecond", 
> 1).load().createOrReplaceTempView("s2")
> val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select 
> s2.value, s2.timestamp from s2")
> unionResult.writeStream.option("checkpointLocation", ${pathA}).start(${pathB})
> {code}
>  
> We'll get the following confusing exception:
> {code:java}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:529)
>   at scala.None$.get(Option.scala:527)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112)
> ...
> {code}
> The union clause in SQL has the requirement of deduplication, the parser will 
> generate {{Distinct(Union)}} and the optimizer rule 
> {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So 
> the root cause here is the checking logic for Aggregate is missing for 
> Distinct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32456) Give better error message for union streams in append mode that don't have a watermark

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165567#comment-17165567
 ] 

Apache Spark commented on SPARK-32456:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/29256

> Give better error message for union streams in append mode that don't have a 
> watermark
> --
>
> Key: SPARK-32456
> URL: https://issues.apache.org/jira/browse/SPARK-32456
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Check the following example:
>  
> {code:java}
> val s1 = spark.readStream.format("rate").option("rowsPerSecond", 
> 1).load().createOrReplaceTempView("s1")
> val s2 = spark.readStream.format("rate").option("rowsPerSecond", 
> 1).load().createOrReplaceTempView("s2")
> val unionRes = spark.sql("select s1.value , s1.timestamp from s1 union select 
> s2.value, s2.timestamp from s2")
> unionResult.writeStream.option("checkpointLocation", ${pathA}).start(${pathB})
> {code}
>  
> We'll get the following confusing exception:
> {code:java}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:529)
>   at scala.None$.get(Option.scala:527)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112)
> ...
> {code}
> The union clause in SQL has the requirement of deduplication, the parser will 
> generate {{Distinct(Union)}} and the optimizer rule 
> {{ReplaceDistinctWithAggregate}} will change it to {{Aggregate(Union)}}. So 
> the root cause here is the checking logic for Aggregate is missing for 
> Distinct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP

2020-07-27 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-32457:


 Summary: logParam thresholds in  DT/GBT/FM/LR/MLP
 Key: SPARK-32457
 URL: https://issues.apache.org/jira/browse/SPARK-32457
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng


param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32417) Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already running task which is going to cache data succeeds on a decommissioned executor

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32417:


Assignee: (was: Apache Spark)

> Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already 
> running task which is going to cache data succeeds on a decommissioned 
> executor
> ---
>
> Key: SPARK-32417
> URL: https://issues.apache.org/jira/browse/SPARK-32417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126424/testReport/
> {code:java}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 2759 times over 30.001772248 
> seconds. Last failure message: Map() was empty We should have a block that 
> has been on multiple BMs in rdds:  
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 
> replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) 
> from: 
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 
> amp-jenkins-worker-05.amp, 45854, 
> None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, 
> 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0)))  
> but instead we got:  Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1).
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 2759 times over 30.001772248 
> seconds. Last failure message: Map() was empty We should have a block that 
> has been on multiple BMs in rdds:
>  ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 
> replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) 
> from:
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 
> amp-jenkins-worker-05.amp, 45854, 
> None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, 
> 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0)))
>  but instead we got:
>  Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1).
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
>   

[jira] [Commented] (SPARK-31587) R installation in Github Actions is being failed

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165511#comment-17165511
 ] 

Apache Spark commented on SPARK-31587:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28380

> R installation in Github Actions is being failed
> 
>
> Key: SPARK-31587
> URL: https://issues.apache.org/jira/browse/SPARK-31587
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra, R
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, R installation seems being failed as below:
> {code}
> Get:61 https://dl.bintray.com/sbt/debian  Packages [4174 B]
> Get:62 http://ppa.launchpad.net/apt-fast/stable/ubuntu bionic/main amd64 
> Packages [532 B]
> Get:63 http://ppa.launchpad.net/git-core/ppa/ubuntu bionic/main amd64 
> Packages [3036 B]
> Get:64 http://ppa.launchpad.net/ondrej/php/ubuntu bionic/main amd64 Packages 
> [52.0 kB]
> Get:65 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main 
> amd64 Packages [33.9 kB]
> Get:66 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic/main 
> Translation-en [10.1 kB]
> Reading package lists...
> E: The repository 'https://cloud.r-project.org/bin/linux/ubuntu 
> bionic-cran35/ Release' does not have a Release file.
> ##[error]Process completed with exit code 100.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29664) Column.getItem behavior is not consistent with Scala version

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165510#comment-17165510
 ] 

Apache Spark commented on SPARK-29664:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28327

> Column.getItem behavior is not consistent with Scala version
> 
>
> Key: SPARK-29664
> URL: https://issues.apache.org/jira/browse/SPARK-29664
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> In PySpark, Column.getItem's behavior is different from the Scala version.
> For example,
> In PySpark:
> {code:python}
> df = spark.range(2)
> map_col = create_map(lit(0), lit(100), lit(1), lit(200))
> df.withColumn("mapped", map_col.getItem(col('id'))).show()
> # +---+--+
> # | id|mapped|
> # +---+--+
> # |  0|   100|
> # |  1|   200|
> # +---+--+
> {code}
> In Scala:
> {code:scala}
> val df = spark.range(2)
> val map_col = map(lit(0), lit(100), lit(1), lit(200))
> // The following getItem results in the following exception, which is the 
> right behavior:
> // java.lang.RuntimeException: Unsupported literal type class 
> org.apache.spark.sql.Column id
> //  at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
> //  at org.apache.spark.sql.Column.getItem(Column.scala:856)
> //  ... 49 elided
> df.withColumn("mapped", map_col.getItem(col("id"))).show
> // You have to use apply() to match with PySpark's behavior.
> df.withColumn("mapped", map_col(col("id"))).show
> // +---+--+
> // | id|mapped|
> // +---+--+
> // |  0|   100|
> // |  1|   200|
> // +---+--+
> {code}
> Looking at the code for Scala implementation, PySpark's behavior is incorrect 
> since the argument to getItem becomes `Literal`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32188) API Reference

2020-07-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32188.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29188
[https://github.com/apache/spark/pull/29188]

> API Reference
> -
>
> Key: SPARK-32188
> URL: https://issues.apache.org/jira/browse/SPARK-32188
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Example: https://hyukjin-spark.readthedocs.io/en/latest/reference/index.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32179) Replace and redesign the documentation base

2020-07-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32179.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29188
[https://github.com/apache/spark/pull/29188]

> Replace and redesign the documentation base
> ---
>
> Key: SPARK-32179
> URL: https://issues.apache.org/jira/browse/SPARK-32179
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> I have a demo site for this task. See 
> https://hyukjin-spark.readthedocs.io/en/latest/ as an example.
> The base work should be fine first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32457:


Assignee: (was: Apache Spark)

> logParam thresholds in  DT/GBT/FM/LR/MLP
> 
>
> Key: SPARK-32457
> URL: https://issues.apache.org/jira/browse/SPARK-32457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165573#comment-17165573
 ] 

Apache Spark commented on SPARK-32457:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29257

> logParam thresholds in  DT/GBT/FM/LR/MLP
> 
>
> Key: SPARK-32457
> URL: https://issues.apache.org/jira/browse/SPARK-32457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165572#comment-17165572
 ] 

Apache Spark commented on SPARK-32457:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/29257

> logParam thresholds in  DT/GBT/FM/LR/MLP
> 
>
> Key: SPARK-32457
> URL: https://issues.apache.org/jira/browse/SPARK-32457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32425) Spark sequence() fails if start and end of range are identical timestamps

2020-07-27 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-32425.
-
Resolution: Duplicate

> Spark sequence() fails if start and end of range are identical timestamps
> -
>
> Key: SPARK-32425
> URL: https://issues.apache.org/jira/browse/SPARK-32425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 on Databricks Runtime 7.1
>Reporter: Lauri Koobas
>Priority: Minor
>
> The following Spark SQL query throws an exception
> {code:java}
> select sequence(current_timestamp, current_timestamp)
> {code}
> The error is:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:78){noformat}
>  Essentially a copy of https://issues.apache.org/jira/browse/SPARK-31980#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32459) UDF regression of WrappedArray supporting caused by SPARK-31826

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165698#comment-17165698
 ] 

Apache Spark commented on SPARK-32459:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29261

> UDF regression of WrappedArray supporting caused by SPARK-31826
> ---
>
> Key: SPARK-32459
> URL: https://issues.apache.org/jira/browse/SPARK-32459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> test("WrappedArray") {
>   val myUdf = udf((a: WrappedArray[Int]) =>
> WrappedArray.make[Int](Array(a.head + 99)))
>   checkAnswer(Seq(Array(1))
> .toDF("col")
> .select(myUdf(Column("col"))),
> Row(ArrayBuffer(100)))
> }{code}
> Execute the above test in master branch, we'll hit the error:
> {code:java}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, 192.168.101.3, executor driver): 
> java.lang.RuntimeException: Error while decoding: 
> java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be 
> cast to scala.collection.mutable.WrappedArray[info]  {code}
> However, the test can be executed successfully in branch-3.0.
>  
> This's actually a regression caused by SPARK-31826. And the regression 
> happens after we changed the catalyst-to-scala converter from 
> CatalystTypeConverters to ExpressionEncoder.deserializer .
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32459) UDF regression of WrappedArray supporting caused by SPARK-31826

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32459:


Assignee: (was: Apache Spark)

> UDF regression of WrappedArray supporting caused by SPARK-31826
> ---
>
> Key: SPARK-32459
> URL: https://issues.apache.org/jira/browse/SPARK-32459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> test("WrappedArray") {
>   val myUdf = udf((a: WrappedArray[Int]) =>
> WrappedArray.make[Int](Array(a.head + 99)))
>   checkAnswer(Seq(Array(1))
> .toDF("col")
> .select(myUdf(Column("col"))),
> Row(ArrayBuffer(100)))
> }{code}
> Execute the above test in master branch, we'll hit the error:
> {code:java}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, 192.168.101.3, executor driver): 
> java.lang.RuntimeException: Error while decoding: 
> java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be 
> cast to scala.collection.mutable.WrappedArray[info]  {code}
> However, the test can be executed successfully in branch-3.0.
>  
> This's actually a regression caused by SPARK-31826. And the regression 
> happens after we changed the catalyst-to-scala converter from 
> CatalystTypeConverters to ExpressionEncoder.deserializer .
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32459) UDF regression of WrappedArray supporting caused by SPARK-31826

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165696#comment-17165696
 ] 

Apache Spark commented on SPARK-32459:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29261

> UDF regression of WrappedArray supporting caused by SPARK-31826
> ---
>
> Key: SPARK-32459
> URL: https://issues.apache.org/jira/browse/SPARK-32459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> test("WrappedArray") {
>   val myUdf = udf((a: WrappedArray[Int]) =>
> WrappedArray.make[Int](Array(a.head + 99)))
>   checkAnswer(Seq(Array(1))
> .toDF("col")
> .select(myUdf(Column("col"))),
> Row(ArrayBuffer(100)))
> }{code}
> Execute the above test in master branch, we'll hit the error:
> {code:java}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, 192.168.101.3, executor driver): 
> java.lang.RuntimeException: Error while decoding: 
> java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be 
> cast to scala.collection.mutable.WrappedArray[info]  {code}
> However, the test can be executed successfully in branch-3.0.
>  
> This's actually a regression caused by SPARK-31826. And the regression 
> happens after we changed the catalyst-to-scala converter from 
> CatalystTypeConverters to ExpressionEncoder.deserializer .
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32459) UDF regression of WrappedArray supporting caused by SPARK-31826

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32459:


Assignee: Apache Spark

> UDF regression of WrappedArray supporting caused by SPARK-31826
> ---
>
> Key: SPARK-32459
> URL: https://issues.apache.org/jira/browse/SPARK-32459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
>  
> {code:java}
> test("WrappedArray") {
>   val myUdf = udf((a: WrappedArray[Int]) =>
> WrappedArray.make[Int](Array(a.head + 99)))
>   checkAnswer(Seq(Array(1))
> .toDF("col")
> .select(myUdf(Column("col"))),
> Row(ArrayBuffer(100)))
> }{code}
> Execute the above test in master branch, we'll hit the error:
> {code:java}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 0.0 (TID 0, 192.168.101.3, executor driver): 
> java.lang.RuntimeException: Error while decoding: 
> java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be 
> cast to scala.collection.mutable.WrappedArray[info]  {code}
> However, the test can be executed successfully in branch-3.0.
>  
> This's actually a regression caused by SPARK-31826. And the regression 
> happens after we changed the catalyst-to-scala converter from 
> CatalystTypeConverters to ExpressionEncoder.deserializer .
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32425) Spark sequence() fails if start and end of range are identical timestamps

2020-07-27 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165810#comment-17165810
 ] 

L. C. Hsieh commented on SPARK-32425:
-

Thanks [~JinxinTang].

> Spark sequence() fails if start and end of range are identical timestamps
> -
>
> Key: SPARK-32425
> URL: https://issues.apache.org/jira/browse/SPARK-32425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 on Databricks Runtime 7.1
>Reporter: Lauri Koobas
>Priority: Minor
>
> The following Spark SQL query throws an exception
> {code:java}
> select sequence(current_timestamp, current_timestamp)
> {code}
> The error is:
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:78){noformat}
>  Essentially a copy of https://issues.apache.org/jira/browse/SPARK-31980#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32431) The .schema() API behaves incorrectly for nested schemas that have column duplicates in case-insensitive mode

2020-07-27 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-32431:
---
Comment: was deleted

(was: I cannot reproduce the issue on master, branch-3.0 and branch-2.4. I 
opened this PR [https://github.com/apache/spark/pull/29234] with tests. 
[~mswit] Could you help me to reproduce the issue, please.)

> The .schema() API behaves incorrectly for nested schemas that have column 
> duplicates in case-insensitive mode
> -
>
> Key: SPARK-32431
> URL: https://issues.apache.org/jira/browse/SPARK-32431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Michał Świtakowski
>Priority: Major
>
> The code below throws org.apache.spark.sql.AnalysisException: Found duplicate 
> column(s) in the data schema: `camelcase`; for multiple file formats due to a 
> duplicate column in the requested schema.
> {code:java}
> import org.apache.spark.sql.types._
> spark.conf.set("spark.sql.caseSensitive", "false")
> val formats = Seq("parquet", "orc", "avro", "json")
> val caseInsensitiveSchema = new StructType().add("LowerCase", 
> LongType).add("camelcase", LongType).add("CamelCase", LongType)
> formats.map{ format =>
> val path = s"/tmp/$format"
> spark
> .range(1L)
> .selectExpr("id AS lowercase", "id + 1 AS camelCase")
> .write.mode("overwrite").format(format).save(path) 
> spark.read.schema(caseInsensitiveSchema).format(format).load(path).show
> }
> {code}
> Similar code with nested schema behaves inconsistently across file formats 
> and sometimes returns incorrect results:
> {code:java}
> import org.apache.spark.sql.types._
> spark.conf.set("spark.sql.caseSensitive", "false")
> val formats = Seq("parquet", "orc", "avro", "json")
> val caseInsensitiveSchema = new StructType().add("StructColumn", new 
> StructType().add("LowerCase", LongType).add("camelcase", 
> LongType).add("CamelCase", LongType))
> formats.map{ format =>
> val path = s"/tmp/$format"
> spark
> .range(1L)
> .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS 
> StructColumn")
> .write.mode("overwrite").format(format).save(path)
> 
> spark.read.schema(caseInsensitiveSchema).format(format).load(path).show
> }
> {code}
> The desired behavior likely should be returning an exception just like in the 
> flat schema scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32420) Add handling for unique key in non-codegen hash join

2020-07-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32420:
---

Assignee: Cheng Su

> Add handling for unique key in non-codegen hash join
> 
>
> Key: SPARK-32420
> URL: https://issues.apache.org/jira/browse/SPARK-32420
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
>
> `HashRelation` has two separate code paths for unique key look up and 
> non-unique key look up E.g. in its subclass 
> `UnsafeHashedRelation`([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177]),
>  unique key look up is more efficient as it does not have extra 
> `Iterator[UnsafeRow].hasNext()/next()` overhead per row.
> `BroadcastHashJoinExec` has handled unique key vs non-unique key separately 
> in code-gen path 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321]).
>  But the non-codegen path for broadcast hash join and shuffled hash join do 
> not separate it yet, so adding the support here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31993) Generated code in 'concat_ws' fails to compile when splitting method is in effect

2020-07-27 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31993:

Component/s: (was: Spark Core)
 SQL

> Generated code in 'concat_ws' fails to compile when splitting method is in 
> effect
> -
>
> Key: SPARK-31993
> URL: https://issues.apache.org/jira/browse/SPARK-31993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0, 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.1.0
>
>
> https://github.com/apache/spark/blob/a0187cd6b59a6b6bb2cadc6711bb663d4d35a844/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L88-L195
> There're three parts of generated code in concat_ws (codes, varargCounts, 
> varargBuilds) and all parts try to split method by itself, while 
> `varargCounts` and `varargBuilds` refer on the generated code in `codes`, 
> hence the overall generated code fails to compile if any of part succeeds to 
> split.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-27 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165934#comment-17165934
 ] 

Thomas Graves commented on SPARK-32429:
---

So this doesn't address the task side, it addresses the executor side. The 
Worker has a discovery script that just returns an array of strings, as long as 
the address is something CUDA_VISIBLE_DEVICES understands its fine. I would 
expect this to be configurable so users could turn it on and off if needed.   
The worker sets it to be how many GPUs it was going to assign to the executor 
before (via passing in the resources file).   Within the executor, we still 
assign a specific GPU to a task when we launch it and the task could set via 
the cuda api (cudaSetDevice) if they wish to restrict it.  Setting 
CUDA_VISIBLE_DEVICES ensures that the task doesn't use a GPU that got assigned 
to another executor. 

This is essentially what YARN and k8s do today running in docker container, the 
container can only see the number of GPUs requested but its still up to app 
code to set to per thread (task) if necessary.

Yeah the python side is perhaps more confusing in that respect, but my 
assumption was just set CUDA_VISIBLE_DEVICES to be the same as the JVM process. 
I think that would still allow you to reuse the python processes for different 
tasks and leaves the same contraint to application code to have to set to 
specifically handle per task setting.  Again I think this is the same as a 
python process spawned inside yarn/k8s docker container with GPU isolation on.  
In general I would expect the GPU to really only be used by either the python 
or the jvm process, but GPUs can handle both using it and context switching as 
long as they don't both use all the memory.

Here is a rough prototype for the java side:

[https://github.com/tgravescs/spark/commit/8f1a13d5eef82f81ef3c424a9a7b4b47903aab7b]

 

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32424) Fix silent data change for timestamp parsing if overflow happens

2020-07-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32424:
---

Assignee: Kent Yao

> Fix silent data change for timestamp parsing if overflow happens
> 
>
> Key: SPARK-32424
> URL: https://issues.apache.org/jira/browse/SPARK-32424
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> the overflow behavior of 3.0 for timestamp parsing has been changed compared 
> to 2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32424) Fix silent data change for timestamp parsing if overflow happens

2020-07-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32424.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29220
[https://github.com/apache/spark/pull/29220]

> Fix silent data change for timestamp parsing if overflow happens
> 
>
> Key: SPARK-32424
> URL: https://issues.apache.org/jira/browse/SPARK-32424
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
>
> the overflow behavior of 3.0 for timestamp parsing has been changed compared 
> to 2.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32332) AQE doesn't adequately allow for Columnar Processing extension

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165897#comment-17165897
 ] 

Apache Spark commented on SPARK-32332:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29262

> AQE doesn't adequately allow for Columnar Processing extension 
> ---
>
> Key: SPARK-32332
> URL: https://issues.apache.org/jira/browse/SPARK-32332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> In SPARK-27396 we added support to extended Columnar Processing. We did the 
> initial work as to what we thought was sufficient but adaptive query 
> execution was being developed at the same time.
> We have discovered that the changes made to AQE are not sufficient for users 
> to properly extend it for columnar processing because AQE hardcodes to look 
> for specific classes/execs.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32332) AQE doesn't adequately allow for Columnar Processing extension

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165898#comment-17165898
 ] 

Apache Spark commented on SPARK-32332:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29262

> AQE doesn't adequately allow for Columnar Processing extension 
> ---
>
> Key: SPARK-32332
> URL: https://issues.apache.org/jira/browse/SPARK-32332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> In SPARK-27396 we added support to extended Columnar Processing. We did the 
> initial work as to what we thought was sufficient but adaptive query 
> execution was being developed at the same time.
> We have discovered that the changes made to AQE are not sufficient for users 
> to properly extend it for columnar processing because AQE hardcodes to look 
> for specific classes/execs.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32429) Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch

2020-07-27 Thread Xiangrui Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165903#comment-17165903
 ] 

Xiangrui Meng commented on SPARK-32429:
---

Couple questions:

1. Which GPU resource name do we use? "spark.task.resource.gpu" does not have 
special meaning in the current implemetnation.
2. I think we can do this for PySpark workers if 1) gets resolved. However, for 
executors running inside the same JVM, is there a way to set 
CUDA_VISIBLE_DEVICES differently per executor thread?

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> -
>
> Key: SPARK-32429
> URL: https://issues.apache.org/jira/browse/SPARK-32429
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32431) The .schema() API behaves incorrectly for nested schemas that have column duplicates in case-insensitive mode

2020-07-27 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-32431:
---
Description: 
The code below throws org.apache.spark.sql.AnalysisException: Found duplicate 
column(s) in the data schema: `camelcase`; for multiple file formats due to a 
duplicate column in the requested schema.
{code:java}
import org.apache.spark.sql.types._
spark.conf.set("spark.sql.caseSensitive", "false")
val formats = Seq("parquet", "orc", "avro", "json")
val caseInsensitiveSchema = new StructType().add("LowerCase", 
LongType).add("camelcase", LongType).add("CamelCase", LongType)
formats.map{ format =>
val path = s"/tmp/$format"
spark
.range(1L)
.selectExpr("id AS lowercase", "id + 1 AS camelCase")
.write.mode("overwrite").format(format).save(path) 
spark.read.schema(caseInsensitiveSchema).format(format).load(path).show
}
{code}
Similar code with nested schema behaves inconsistently across file formats and 
sometimes returns incorrect results:
{code:java}
import org.apache.spark.sql.types._
spark.conf.set("spark.sql.caseSensitive", "false")
val formats = Seq("parquet", "orc", "avro", "json")
val caseInsensitiveSchema = new StructType().add("StructColumn", new 
StructType().add("LowerCase", LongType).add("camelcase", 
LongType).add("CamelCase", LongType))
formats.map{ format =>
val path = s"/tmp/$format"
spark
.range(1L)
.selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS 
StructColumn")
.write.mode("overwrite").format(format).save(path)

spark.read.schema(caseInsensitiveSchema).format(format).load(path).show
}
{code}
The desired behavior likely should be returning an exception just like in the 
flat schema scenario.

  was:
The code below throws org.apache.spark.sql.AnalysisException: Found duplicate 
column(s) in the data schema: `camelcase`; for multiple file formats due to a 
duplicate column in the requested schema.
{code:java}
import org.apache.spark.sql.types._
spark.conf.set("spark.sql.caseSensitive", "false")
val formats = Seq("parquet", "orc", "avro", "json")
val caseInsensitiveSchema = new StructType().add("LowerCase", 
LongType).add("camelcase", LongType).add("CamelCase", LongType)
formats.map{ format =>
val path = s"/tmp/$format"
spark
.range(1L)
.selectExpr("id AS lowercase", "id + 1 AS camelCase")
.write.mode("overwrite").format(format).save(path) 
spark.read.schema(caseInsensitiveSchema).format(format).load(path).show
}
{code}
Similar code with nested schema behaves inconsistently across file formats and 
sometimes returns incorrect results:
{code:java}
import org.apache.spark.sql.types._
spark.conf.set("spark.sql.caseSensitive", "false")val formats = Seq("parquet", 
"orc", "avro", "json")val caseInsensitiveSchema = new 
StructType().add("StructColumn", new StructType().add("LowerCase", 
LongType).add("camelcase", LongType).add("CamelCase", LongType))formats.map{ 
format =>
val path = s"/tmp/$format"
spark
.range(1L)
.selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS 
StructColumn")
.write.mode("overwrite").format(format).save(path)

spark.read.schema(caseInsensitiveSchema).format(format).load(path).show
}
{code}
The desired behavior likely should be returning an exception just like in the 
flat schema scenario.


> The .schema() API behaves incorrectly for nested schemas that have column 
> duplicates in case-insensitive mode
> -
>
> Key: SPARK-32431
> URL: https://issues.apache.org/jira/browse/SPARK-32431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Michał Świtakowski
>Priority: Major
>
> The code below throws org.apache.spark.sql.AnalysisException: Found duplicate 
> column(s) in the data schema: `camelcase`; for multiple file formats due to a 
> duplicate column in the requested schema.
> {code:java}
> import org.apache.spark.sql.types._
> spark.conf.set("spark.sql.caseSensitive", "false")
> val formats = Seq("parquet", "orc", "avro", "json")
> val caseInsensitiveSchema = new StructType().add("LowerCase", 
> LongType).add("camelcase", LongType).add("CamelCase", LongType)
> formats.map{ format =>
> val path = s"/tmp/$format"
> spark
> .range(1L)
> .selectExpr("id AS lowercase", "id + 1 AS camelCase")
> .write.mode("overwrite").format(format).save(path) 
> spark.read.schema(caseInsensitiveSchema).format(format).load(path).show
> }
> {code}
> Similar code with nested schema behaves inconsistently across file formats 
> and sometimes returns incorrect results:
> {code:java}
> import org.apache.spark.sql.types._
> spark.conf.set("spark.sql.caseSensitive", "false")
> val formats = 

[jira] [Created] (SPARK-32460) how spark collects non-match results after performing broadcast left outer join

2020-07-27 Thread farshad delavarpour (Jira)
farshad delavarpour created SPARK-32460:
---

 Summary: how spark collects non-match results after performing 
broadcast left outer join
 Key: SPARK-32460
 URL: https://issues.apache.org/jira/browse/SPARK-32460
 Project: Spark
  Issue Type: Question
  Components: Spark Core, SQL
Affects Versions: 2.4.0
Reporter: farshad delavarpour


Does anybody know how spark collects non-match results after performing 
broadcast hash left outer join?

Suppose we have 4 nodes. 1 driver and 3 executors. We broadcast the left table. 
After left outer join is performed in each executor, how does spark recognize 
which records have not been matched, and how does it collect them to the final 
result?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32443) Fix testCommandAvailable to use POSIX compatible `command -v`

2020-07-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32443:
-

Assignee: Hyukjin Kwon  (was: Dongjoon Hyun)

> Fix testCommandAvailable to use POSIX compatible `command -v`
> -
>
> Key: SPARK-32443
> URL: https://issues.apache.org/jira/browse/SPARK-32443
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, we run the command to check the existence. This is dangerous and 
> doesn't work sometimes.
> {code}
> scala> sys.process.Process("cat").run().exitValue()
> res0: Int = 0
> scala> sys.process.Process("ls").run().exitValue()
> LICENSE
> NOTICE
> bin
> doc
> lib
> man
> res1: Int = 0
> scala> sys.process.Process("rm").run().exitValue()
> usage: rm [-f | -i] [-dPRrvW] file ...
>unlink file
> res4: Int = 64
> {code}
> {code}
> scala> sys.process.Process("command -v rm").run().exitValue()
> /bin/rm
> res5: Int = 0
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32443) Fix testCommandAvailable to use POSIX compatible `command -v`

2020-07-27 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32443.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29241
[https://github.com/apache/spark/pull/29241]

> Fix testCommandAvailable to use POSIX compatible `command -v`
> -
>
> Key: SPARK-32443
> URL: https://issues.apache.org/jira/browse/SPARK-32443
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, we run the command to check the existence. This is dangerous and 
> doesn't work sometimes.
> {code}
> scala> sys.process.Process("cat").run().exitValue()
> res0: Int = 0
> scala> sys.process.Process("ls").run().exitValue()
> LICENSE
> NOTICE
> bin
> doc
> lib
> man
> res1: Int = 0
> scala> sys.process.Process("rm").run().exitValue()
> usage: rm [-f | -i] [-dPRrvW] file ...
>unlink file
> res4: Int = 64
> {code}
> {code}
> scala> sys.process.Process("command -v rm").run().exitValue()
> /bin/rm
> res5: Int = 0
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32420) Add handling for unique key in non-codegen hash join

2020-07-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32420.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29216
[https://github.com/apache/spark/pull/29216]

> Add handling for unique key in non-codegen hash join
> 
>
> Key: SPARK-32420
> URL: https://issues.apache.org/jira/browse/SPARK-32420
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.1.0
>
>
> `HashRelation` has two separate code paths for unique key look up and 
> non-unique key look up E.g. in its subclass 
> `UnsafeHashedRelation`([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177]),
>  unique key look up is more efficient as it does not have extra 
> `Iterator[UnsafeRow].hasNext()/next()` overhead per row.
> `BroadcastHashJoinExec` has handled unique key vs non-unique key separately 
> in code-gen path 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321]).
>  But the non-codegen path for broadcast hash join and shuffled hash join do 
> not separate it yet, so adding the support here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP

2020-07-27 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-32457.

Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29257
[https://github.com/apache/spark/pull/29257]

> logParam thresholds in  DT/GBT/FM/LR/MLP
> 
>
> Key: SPARK-32457
> URL: https://issues.apache.org/jira/browse/SPARK-32457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 3.1.0
>
>
> param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32457) logParam thresholds in DT/GBT/FM/LR/MLP

2020-07-27 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao reassigned SPARK-32457:
--

Assignee: zhengruifeng

> logParam thresholds in  DT/GBT/FM/LR/MLP
> 
>
> Key: SPARK-32457
> URL: https://issues.apache.org/jira/browse/SPARK-32457
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
>
> param thresholds is logged in NB/RF, but not in other ProbabilisticClassifier



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32421) Add code-gen for shuffled hash join

2020-07-27 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32421:
-
Parent: SPARK-32461
Issue Type: Sub-task  (was: Improvement)

> Add code-gen for shuffled hash join
> ---
>
> Key: SPARK-32421
> URL: https://issues.apache.org/jira/browse/SPARK-32421
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> We added shuffled hash join codegen internally in our fork, and seeing 
> obvious improvement in benchmark compared to current non-codegen code path. 
> Creating this Jira to add this support. Shuffled hash join codegen is very 
> similar to broadcast hash join codegen. So this is a simple change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32462) Don't save the previous search text for datatable

2020-07-27 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-32462:
--

 Summary: Don't save the previous search text for datatable
 Key: SPARK-32462
 URL: https://issues.apache.org/jira/browse/SPARK-32462
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


DataTable is used in stage-page and executors-page for pagination and filter 
tasks/executors by search text.

In the current implementation, the keyword is saved so if we visit stage-page 
for a job, the previous search text is filled in the textbox and the task table 
is filtered.

I'm sometimes surprised by this behavior as the stage-page lists no tasks 
because tasks are filtered by the previous search text.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21505) A dynamic join operator to improve the join reliability

2020-07-27 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-21505:
-
Parent: SPARK-32461
Issue Type: Sub-task  (was: New Feature)

> A dynamic join operator to improve the join reliability
> ---
>
> Key: SPARK-21505
> URL: https://issues.apache.org/jira/browse/SPARK-21505
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 3.0.0
>Reporter: Lin
>Priority: Major
>  Labels: features
>
> As we know, hash join is more efficient than sort merge join. But today hash 
> join is not so widely used because it may fail with OutOfMemory (OOM) error 
> due to limited memory resource, data skew, statistics mis-estimation and so 
> on. For example, if we apply shuffle hash join on an uneven distributed 
> dataset, some partitions might be so large  that we cannot make a Hash table 
> for this particular partition causing OOM error. When OOM happens, current 
> Spark technology will throw an Exception, resulting in job failure. On the 
> other hand, if sort-merge join is used, there will be shuffle, sorting and 
> extra spill, causing the degradation of the join. Considering the efficiency 
> of hash join, we want to propose a fallback mechanism to dynamically use hash 
> join or sort-merge join at runtime at task level to provide a more reliable 
> join operation.
> This new dynamic join operator internally implements the logic of HashJoin, 
> Iterator Reconstruct, Sort, and MergeJoin.  We show the process of this 
> dynamic join method as following:
> HashJoin: We start from building  Hash table on one side of join partitions. 
> If Hash table is built successfully, it would be the same as the current 
> ShuffledHashJoin operator. 
> Sort: If we fail to build Hash table due to the large partition size, we do 
> SortMergeJoin only on this partition. But we need to rebuild the   When OOM 
> happens, a Hash table corresponding to partial part of this partition has 
> been built successfully (e.g. first 4000 rows of RDD), and the iterator of 
> this partition is now pointing to the 4001st row of partition. We reuse this 
> hash table to reconstruct the iterator for the first 4000 rows and 
> concatenate  with the rest rows of this partition so that we can rebuild this 
> partition completely. On this re-built partition, we apply sorting based on 
> key values.
> MergeJoin: After getting two sorted Iterators, we perform regular merge join 
> against them and emits the records to downstream operators.
> Iterator Reconstruct:  BytesToBytesMap has to be spilled to disk to release 
> the memory for other operators, such as Sort, Join, etc. In addition, it has 
> to be converted to Iterator, so that it can be concatenated with remaining 
> items in the original iterator that is used to build the hash table.
> Meta Data Population: Necessary metadata, such as sorting keys, jointype, 
> etc,  has to be populated, so that they are used for potential Sort and 
> MergeJoin operator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32461) Shuffled hash join improvement

2020-07-27 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165951#comment-17165951
 ] 

Cheng Su commented on SPARK-32461:
--

Just FYI - I am working on each sub-tasks separately now.

> Shuffled hash join improvement
> --
>
> Key: SPARK-32461
> URL: https://issues.apache.org/jira/browse/SPARK-32461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Shuffled hash join avoids sort compared to sort merge join. This advantage 
> shows up obviously when joining large table in terms of saving CPU and IO (in 
> case of external sort happens). In latest master trunk, shuffled hash join is 
> disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, 
> with favor of reducing risk of OOM. However shuffled hash join could be 
> improved to a better state (validated in our internal fork). Creating this 
> Jira to track overall progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32417) Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already running task which is going to cache data succeeds on a decommissioned executor

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165979#comment-17165979
 ] 

Apache Spark commented on SPARK-32417:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29263

> Flaky test: BlockManagerDecommissionIntegrationSuite.verify that an already 
> running task which is going to cache data succeeds on a decommissioned 
> executor
> ---
>
> Key: SPARK-32417
> URL: https://issues.apache.org/jira/browse/SPARK-32417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126424/testReport/
> {code:java}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 2759 times over 30.001772248 
> seconds. Last failure message: Map() was empty We should have a block that 
> has been on multiple BMs in rdds:  
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 
> replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) 
> from: 
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 
> amp-jenkins-worker-05.amp, 45854, 
> None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, 
> 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0)))  
> but instead we got:  Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 -> 1).
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 2759 times over 30.001772248 
> seconds. Last failure message: Map() was empty We should have a block that 
> has been on multiple BMs in rdds:
>  ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> localhost, 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 
> replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0))) 
> from:
> ArrayBuffer(SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, 
> amp-jenkins-worker-05.amp, 45854, 
> None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(2, localhost, 
> 42805, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),broadcast_1_piece0,StorageLevel(memory, 1 replicas),2695,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_2,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, localhost, 
> 42041, None),rdd_1_1,StorageLevel(memory, deserialized, 1 replicas),56,0)), 
> SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, localhost, 
> 37968, None),rdd_1_0,StorageLevel(memory, deserialized, 1 replicas),56,0)))
>  but instead we got:
>  Map(rdd_1_0 -> 1, rdd_1_2 -> 1, rdd_1_1 

[jira] [Updated] (SPARK-32462) Don't save the previous search text for datatable

2020-07-27 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32462:
---
Description: 
DataTable is used in stage-page and executors-page for pagination and filter 
tasks/executors by search text.

In the current implementation, search text is saved so if we visit stage-page 
for a job, the previous search text is filled in the textbox and the task table 
is filtered.

I'm sometimes surprised by this behavior as the stage-page lists no tasks 
because tasks are filtered by the previous search text.

  was:
DataTable is used in stage-page and executors-page for pagination and filter 
tasks/executors by search text.

In the current implementation, the keyword is saved so if we visit stage-page 
for a job, the previous search text is filled in the textbox and the task table 
is filtered.

I'm sometimes surprised by this behavior as the stage-page lists no tasks 
because tasks are filtered by the previous search text.


> Don't save the previous search text for datatable
> -
>
> Key: SPARK-32462
> URL: https://issues.apache.org/jira/browse/SPARK-32462
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> DataTable is used in stage-page and executors-page for pagination and filter 
> tasks/executors by search text.
> In the current implementation, search text is saved so if we visit stage-page 
> for a job, the previous search text is filled in the textbox and the task 
> table is filtered.
> I'm sometimes surprised by this behavior as the stage-page lists no tasks 
> because tasks are filtered by the previous search text.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32383) Preserve hash join (BHJ and SHJ) stream side ordering

2020-07-27 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32383:
-
Parent: SPARK-32461
Issue Type: Sub-task  (was: Improvement)

> Preserve hash join (BHJ and SHJ) stream side ordering
> -
>
> Key: SPARK-32383
> URL: https://issues.apache.org/jira/browse/SPARK-32383
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.1.0
>
>
> Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve 
> children output ordering information (inherit from 
> `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in 
> complex queries involved multiple joins.
> Example:
>  
> {code:java}
> withSQLConf(
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") {
>   val df1 = spark.range(100).select($"id".as("k1"))
>   val df2 = spark.range(100).select($"id".as("k2"))
>   val df3 = spark.range(3).select($"id".as("k3"))
>   val df4 = spark.range(100).select($"id".as("k4"))
>   val plan = df1.join(df2, $"k1" === $"k2")
> .join(df3, $"k1" === $"k3")
> .join(df4, $"k1" === $"k4")
> .queryExecution
> .executedPlan
> }
> {code}
>  
> Current physical plan (extra sort on `k1` before top sort merge join):
> {code:java}
> *(9) SortMergeJoin [k1#220L], [k4#232L], Inner
> :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0
> :  +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
> : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
> : :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
> : :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128]
> : :  : +- *(1) Project [id#218L AS k1#220L]
> : :  :+- *(1) Range (0, 100, step=1, splits=2)
> : :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
> : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134]
> : :+- *(3) Project [id#222L AS k2#224L]
> : :   +- *(3) Range (0, 100, step=1, splits=2)
> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false])), [id=#141]
> :+- *(5) Project [id#226L AS k3#228L]
> :   +- *(5) Range (0, 3, step=1, splits=2)
> +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k4#232L, 5), true, [id=#148]
>   +- *(7) Project [id#230L AS k4#232L]
>  +- *(7) Range (0, 100, step=1, splits=2)
> {code}
> Ideal physical plan (no extra sort on `k1` before top sort merge join):
> {code:java}
> *(9) SortMergeJoin [k1#220L], [k4#232L], Inner
> :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
> :  :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
> :  :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
> :  :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127]
> :  :  : +- *(1) Project [id#218L AS k1#220L]
> :  :  :+- *(1) Range (0, 100, step=1, splits=2)
> :  :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
> :  : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133]
> :  :+- *(3) Project [id#222L AS k2#224L]
> :  :   +- *(3) Range (0, 100, step=1, splits=2)
> :  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false])), [id=#140]
> : +- *(5) Project [id#226L AS k3#228L]
> :+- *(5) Range (0, 3, step=1, splits=2)
> +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k4#232L, 5), true, [id=#146]
>   +- *(7) Project [id#230L AS k4#232L]
>  +- *(7) Range (0, 100, step=1, splits=2){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32420) Add handling for unique key in non-codegen hash join

2020-07-27 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32420:
-
Parent: SPARK-32461
Issue Type: Sub-task  (was: Improvement)

> Add handling for unique key in non-codegen hash join
> 
>
> Key: SPARK-32420
> URL: https://issues.apache.org/jira/browse/SPARK-32420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.1.0
>
>
> `HashRelation` has two separate code paths for unique key look up and 
> non-unique key look up E.g. in its subclass 
> `UnsafeHashedRelation`([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177]),
>  unique key look up is more efficient as it does not have extra 
> `Iterator[UnsafeRow].hasNext()/next()` overhead per row.
> `BroadcastHashJoinExec` has handled unique key vs non-unique key separately 
> in code-gen path 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321]).
>  But the non-codegen path for broadcast hash join and shuffled hash join do 
> not separate it yet, so adding the support here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32399) Support full outer join in shuffled hash join and broadcast hash join

2020-07-27 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32399:
-
Parent: SPARK-32461
Issue Type: Sub-task  (was: Improvement)

> Support full outer join in shuffled hash join and broadcast hash join
> -
>
> Key: SPARK-32399
> URL: https://issues.apache.org/jira/browse/SPARK-32399
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Currently for SQL full outer join, spark always does a sort merge join no 
> matter of how large the join children size are. Inspired by recent discussion 
> in [https://github.com/apache/spark/pull/29130#discussion_r456502678] and 
> [https://github.com/apache/spark/pull/29181], I think we can support full 
> outer join in shuffled hash join and broadcast hash join in a way that - when 
> looking up stream side keys from build side {{HashedRelation}}. Mark this 
> info inside build side {{HashedRelation}}, and after reading all rows from 
> stream side, output all non-matching rows from build side based on modified 
> {{HashedRelation}}. But more design details need to be figured out for this 
> JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28210) Shuffle Storage API: Reads

2020-07-27 Thread Tianchen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166027#comment-17166027
 ] 

Tianchen Zhang edited comment on SPARK-28210 at 7/27/20, 11:40 PM:
---

Hi [~devaraj], do you mind sharing some ideas about your change? Is it based on 
the linked PR in this JIRA? We are also finding a way to have our own storage 
implementation but are blocked by the lack of reader's API. Thanks.


was (Author: tianczha):
Hi [~devaraj], do you mind share some ideas about your change? Is it based on 
the linked PR in this JIRA? We are also finding a way to have our own storage 
implementation but are blocked by the lack of reader's API. Thanks.

> Shuffle Storage API: Reads
> --
>
> Key: SPARK-28210
> URL: https://issues.apache.org/jira/browse/SPARK-28210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> As part of the effort to store shuffle data in arbitrary places, this issue 
> tracks implementing an API for reading the shuffle data stored by the write 
> API. Also ensure that the existing shuffle implementation is refactored to 
> use the API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28210) Shuffle Storage API: Reads

2020-07-27 Thread Tianchen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166027#comment-17166027
 ] 

Tianchen Zhang commented on SPARK-28210:


Hi [~devaraj], do you mind share some ideas about your change? Is it based on 
the linked PR in this JIRA? We are also finding a way to have our own storage 
implementation but are blocked by the lack of reader's API. Thanks.

> Shuffle Storage API: Reads
> --
>
> Key: SPARK-28210
> URL: https://issues.apache.org/jira/browse/SPARK-28210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> As part of the effort to store shuffle data in arbitrary places, this issue 
> tracks implementing an API for reading the shuffle data stored by the write 
> API. Also ensure that the existing shuffle implementation is refactored to 
> use the API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32461) Shuffled hash join improvement

2020-07-27 Thread Cheng Su (Jira)
Cheng Su created SPARK-32461:


 Summary: Shuffled hash join improvement
 Key: SPARK-32461
 URL: https://issues.apache.org/jira/browse/SPARK-32461
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Cheng Su


Shuffled hash join avoids sort compared to sort merge join. This advantage 
shows up obviously when joining large table in terms of saving CPU and IO (in 
case of external sort happens). In latest master trunk, shuffled hash join is 
disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, with 
favor of reducing risk of OOM. However shuffled hash join could be improved to 
a better state (validated in our internal fork). Creating this Jira to track 
overall progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32330) Preserve shuffled hash join build side partitioning

2020-07-27 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32330:
-
Parent: SPARK-32461
Issue Type: Sub-task  (was: Improvement)

> Preserve shuffled hash join build side partitioning
> ---
>
> Key: SPARK-32330
> URL: https://issues.apache.org/jira/browse/SPARK-32330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently `ShuffledHashJoin.outputPartitioning` inherits from 
> `HashJoin.outputPartitioning`, which only preserves stream side partitioning:
> `HashJoin.scala`
> {code:java}
> override def outputPartitioning: Partitioning = 
> streamedPlan.outputPartitioning
> {code}
> This loses build side partitioning information, and causes extra shuffle if 
> there's another join / group-by after this join.
> Example:
>  
> {code:java}
> // code placeholder
> withSQLConf(
> SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "2",
> SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
>   val df1 = spark.range(10).select($"id".as("k1"))
>   val df2 = spark.range(30).select($"id".as("k2"))
>   Seq("inner", "cross").foreach(joinType => {
> val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count()
>   .queryExecution.executedPlan
> assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1)
> // No extra shuffle before aggregate
> assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2)
>   })
> }{code}
>  
> Current physical plan (having an extra shuffle on `k1` before aggregate)
>  
> {code:java}
> *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117]
>+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>   +- *(3) Project [k1#220L]
>  +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
> :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109]
> :  +- *(1) Project [id#218L AS k1#220L]
> : +- *(1) Range (0, 10, step=1, splits=2)
> +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111]
>+- *(2) Project [id#222L AS k2#224L]
>   +- *(2) Range (0, 30, step=1, splits=2){code}
>  
> Ideal physical plan (no shuffle on `k1` before aggregate)
> {code:java}
>  *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>+- *(3) Project [k1#220L]
>   +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
>  :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107]
>  :  +- *(1) Project [id#218L AS k1#220L]
>  : +- *(1) Range (0, 10, step=1, splits=2)
>  +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109]
> +- *(2) Project [id#222L AS k2#224L]
>+- *(2) Range (0, 30, step=1, splits=2){code}
>  
> This can be fixed by overriding `outputPartitioning` method in 
> `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32286) Coalesce bucketed tables for shuffled hash join if applicable

2020-07-27 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32286:
-
Parent: SPARK-32461
Issue Type: Sub-task  (was: Improvement)

> Coalesce bucketed tables for shuffled hash join if applicable
> -
>
> Key: SPARK-32286
> URL: https://issues.apache.org/jira/browse/SPARK-32286
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.1.0
>
>
> Based on a follow up comment in PR 
> [#28123|https://github.com/apache/spark/pull/28123], where we can coalesce 
> buckets for shuffled hash join as well. The note here is we only coalesce the 
> buckets from shuffled hash join stream side (i.e. the side not building hash 
> map), so we don't need to worry about OOM when coalescing multiple buckets in 
> one task for building hash map.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32462) Don't save the previous search text for datatable

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32462:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Don't save the previous search text for datatable
> -
>
> Key: SPARK-32462
> URL: https://issues.apache.org/jira/browse/SPARK-32462
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> DataTable is used in stage-page and executors-page for pagination and filter 
> tasks/executors by search text.
> In the current implementation, search text is saved so if we visit stage-page 
> for a job, the previous search text is filled in the textbox and the task 
> table is filtered.
> I'm sometimes surprised by this behavior as the stage-page lists no tasks 
> because tasks are filtered by the previous search text.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32462) Don't save the previous search text for datatable

2020-07-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166025#comment-17166025
 ] 

Apache Spark commented on SPARK-32462:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29265

> Don't save the previous search text for datatable
> -
>
> Key: SPARK-32462
> URL: https://issues.apache.org/jira/browse/SPARK-32462
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> DataTable is used in stage-page and executors-page for pagination and filter 
> tasks/executors by search text.
> In the current implementation, search text is saved so if we visit stage-page 
> for a job, the previous search text is filled in the textbox and the task 
> table is filtered.
> I'm sometimes surprised by this behavior as the stage-page lists no tasks 
> because tasks are filtered by the previous search text.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32462) Don't save the previous search text for datatable

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32462:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Don't save the previous search text for datatable
> -
>
> Key: SPARK-32462
> URL: https://issues.apache.org/jira/browse/SPARK-32462
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> DataTable is used in stage-page and executors-page for pagination and filter 
> tasks/executors by search text.
> In the current implementation, search text is saved so if we visit stage-page 
> for a job, the previous search text is filled in the textbox and the task 
> table is filtered.
> I'm sometimes surprised by this behavior as the stage-page lists no tasks 
> because tasks are filtered by the previous search text.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31753) Add missing keywords in the SQL documents

2020-07-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31753:
-
Affects Version/s: (was: 3.0.0)
   3.0.1

> Add missing keywords in the SQL documents
> -
>
> Key: SPARK-31753
> URL: https://issues.apache.org/jira/browse/SPARK-31753
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> Some keywords are missing in the SQL documents and a list of them is as 
> follows. 
>  [https://github.com/apache/spark/pull/28290#issuecomment-619321301]
> {code:java}
> AFTER
> CASE/ELSE
> WHEN/THEN
> IGNORE NULLS
> LATERAL VIEW (OUTER)?
> MAP KEYS TERMINATED BY
> NULL DEFINED AS
> LINES TERMINATED BY
> ESCAPED BY
> COLLECTION ITEMS TERMINATED BY
> EXPLAIN LOGICAL
> PIVOT
> {code}
> They should be documented there, too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31753) Add missing keywords in the SQL documents

2020-07-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31753:
-
Affects Version/s: 3.0.0

> Add missing keywords in the SQL documents
> -
>
> Key: SPARK-31753
> URL: https://issues.apache.org/jira/browse/SPARK-31753
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> Some keywords are missing in the SQL documents and a list of them is as 
> follows. 
>  [https://github.com/apache/spark/pull/28290#issuecomment-619321301]
> {code:java}
> AFTER
> CASE/ELSE
> WHEN/THEN
> IGNORE NULLS
> LATERAL VIEW (OUTER)?
> MAP KEYS TERMINATED BY
> NULL DEFINED AS
> LINES TERMINATED BY
> ESCAPED BY
> COLLECTION ITEMS TERMINATED BY
> EXPLAIN LOGICAL
> PIVOT
> {code}
> They should be documented there, too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31753) Add missing keywords in the SQL documents

2020-07-27 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31753.
--
Fix Version/s: 3.0.1
 Assignee: philipse
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/29056]

> Add missing keywords in the SQL documents
> -
>
> Key: SPARK-31753
> URL: https://issues.apache.org/jira/browse/SPARK-31753
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: philipse
>Priority: Minor
> Fix For: 3.0.1
>
>
> Some keywords are missing in the SQL documents and a list of them is as 
> follows. 
>  [https://github.com/apache/spark/pull/28290#issuecomment-619321301]
> {code:java}
> AFTER
> CASE/ELSE
> WHEN/THEN
> IGNORE NULLS
> LATERAL VIEW (OUTER)?
> MAP KEYS TERMINATED BY
> NULL DEFINED AS
> LINES TERMINATED BY
> ESCAPED BY
> COLLECTION ITEMS TERMINATED BY
> EXPLAIN LOGICAL
> PIVOT
> {code}
> They should be documented there, too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32463) Document Data Type inference rule in SQL reference

2020-07-27 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-32463:
--

 Summary: Document Data Type inference rule in SQL reference
 Key: SPARK-32463
 URL: https://issues.apache.org/jira/browse/SPARK-32463
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Document Data Type inference rule in SQL reference, under Data Types section. 
Please see this PR https://github.com/apache/spark/pull/28896



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32464) Support skew handling on join with one side that has no query stage

2020-07-27 Thread Wang, Gang (Jira)
Wang, Gang created SPARK-32464:
--

 Summary: Support skew handling on join with one side that has no 
query stage
 Key: SPARK-32464
 URL: https://issues.apache.org/jira/browse/SPARK-32464
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wang, Gang


In our production environment, there are many bucket tables, which are used to 
join with other tables. And there are some skewed joins now and then. While, in 
current implementation, the skew join handling can only applied when both sides 
of a SMJ are QueryStages. So skew join handling is not able to deal with such 
cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32439) Override datasource implementation during look up via configuration

2020-07-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32439.
--
Resolution: Won't Fix

> Override datasource implementation during look up via configuration
> ---
>
> Key: SPARK-32439
> URL: https://issues.apache.org/jira/browse/SPARK-32439
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> We need a mechanism to override the datasource implementation via 
> configuration.
> For example, suppose I have a custom CSV datasource implementation called 
> "my_csv". One way to use it is:
> {code}
>  val df = spark.read.format("my_csv").load(...)
> {code}
> Since the source data is the same format (CSV), you should be able to 
> override the default implementation.
> One proposal is to do the following:
> {code}
> spark.conf.set("spark.sql.datasource.override.csv", "my_csv") val df = 
> spark.read.csv(...)
> {code}
> This has a benefit that the user doesn't have to change the application code 
> to try out a new datasource implementation for the same source format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32361) Remove project if output is subset of child

2020-07-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166094#comment-17166094
 ] 

Hyukjin Kwon commented on SPARK-32361:
--

Please fill JIRA description.

> Remove project if output is subset of child
> ---
>
> Key: SPARK-32361
> URL: https://issues.apache.org/jira/browse/SPARK-32361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32369) pyspark foreach/foreachPartition send http request failed

2020-07-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32369:
-
Description: 
I use urllib.request to send http request in foreach/foreachPartition. pyspark 
throw error as follow:I use urllib.request to send http request in 
foreach/foreachPartition. pyspark throw error as follow:

{code}
_objc[74094]: +[__NSPlaceholderDate initialize] may have been in progress in 
another thread when fork() was called. We cannot safely call it or ignore it in 
the fork() child process. Crashing instead. Set a breakpoint on 
objc_initializeAfterForkError to debug.20/07/20 19:05:58 ERROR Executor: 
Exception in task 7.0 in stage 0.0 (TID 7)org.apache.spark.SparkException: 
Python worker exited unexpectedly (crashed)        at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:536)
         at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:525)
         at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)   
      at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:643)   
      at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)   
      at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
         at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)  
       at scala.collection.Iterator.foreach(Iterator.scala:941)         at 
scala.collection.Iterator.foreach$(Iterator.scala:941)         at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)  
       at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)    
     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)     
    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)       
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)   
      at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)         
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)         at 
org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)       
  at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)       
  at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)      
   at 
org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) 
        at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)  
       at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)  
       at 
org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)  
       at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1004)         
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2133)_
{code}


when i call rdd.foreach(send_http), 
rdd=sc.parallelize(["http://192.168.1.1:5000/index.html;]), send_http defined 
as follow:

_def send_http(url):_    

_req = urllib.request.Request(url)_    

_resp = urllib.request.urlopen(req)_

anyone can tell me the problem? thanks.

  was:
I use urllib.request to send http request in foreach/foreachPartition. pyspark 
throw error as follow:I use urllib.request to send http request in 
foreach/foreachPartition. pyspark throw error as follow:

_objc[74094]: +[__NSPlaceholderDate initialize] may have been in progress in 
another thread when fork() was called. We cannot safely call it or ignore it in 
the fork() child process. Crashing instead. Set a breakpoint on 
objc_initializeAfterForkError to debug.20/07/20 19:05:58 ERROR Executor: 
Exception in task 7.0 in stage 0.0 (TID 7)org.apache.spark.SparkException: 
Python worker exited unexpectedly (crashed)        at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:536)
         at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:525)
         at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)   
      at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:643)   
      at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)   
      at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
         at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)  
       at scala.collection.Iterator.foreach(Iterator.scala:941)         at 
scala.collection.Iterator.foreach$(Iterator.scala:941)         at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)  
       at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)    
     at 

[jira] [Commented] (SPARK-32359) Implement max_error metric evaluator for spark regression mllib

2020-07-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166095#comment-17166095
 ] 

Hyukjin Kwon commented on SPARK-32359:
--

Please fill JIRA description.

> Implement max_error metric evaluator for spark regression mllib
> ---
>
> Key: SPARK-32359
> URL: https://issues.apache.org/jira/browse/SPARK-32359
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Sayed Mohammad Hossein Torabi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32423) class 'DataFrame' returns instance of type(self) instead of DataFrame

2020-07-27 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166114#comment-17166114
 ] 

Hyukjin Kwon commented on SPARK-32423:
--

Can you show some pseudo codes? It's a bit difficult to follow what the JIRA 
targets.

> class 'DataFrame' returns instance of type(self) instead of DataFrame 
> --
>
> Key: SPARK-32423
> URL: https://issues.apache.org/jira/browse/SPARK-32423
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Timothy
>Priority: Minor
>
> To allow for appropriate child classing of DataFrame, I propose the following 
> change:
> class 'DataFrame' returns instance of type(self) instead of  typeDataFrame 
>  
> Therefore child classes using methods such as '.limit()' will return an 
> instance of the child class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32463) Document Data Type inference rule in SQL reference

2020-07-27 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166045#comment-17166045
 ] 

Huaxin Gao commented on SPARK-32463:


[~planga82] You are welcomed to work on this if you have free time. Thanks!

> Document Data Type inference rule in SQL reference
> --
>
> Key: SPARK-32463
> URL: https://issues.apache.org/jira/browse/SPARK-32463
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Document Data Type inference rule in SQL reference, under Data Types section. 
> Please see this PR https://github.com/apache/spark/pull/28896



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32464) Support skew handling on join that has one side with no query stage

2020-07-27 Thread Wang, Gang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Gang updated SPARK-32464:
---
Summary: Support skew handling on join that has one side with no query 
stage  (was: Support skew handling on join with one side that has no query 
stage)

> Support skew handling on join that has one side with no query stage
> ---
>
> Key: SPARK-32464
> URL: https://issues.apache.org/jira/browse/SPARK-32464
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wang, Gang
>Priority: Major
>
> In our production environment, there are many bucket tables, which are used 
> to join with other tables. And there are some skewed joins now and then. 
> While, in current implementation, the skew join handling can only applied 
> when both sides of a SMJ are QueryStages. So skew join handling is not able 
> to deal with such cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32464) Support skew handling on join that has one side with no query stage

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32464:


Assignee: (was: Apache Spark)

> Support skew handling on join that has one side with no query stage
> ---
>
> Key: SPARK-32464
> URL: https://issues.apache.org/jira/browse/SPARK-32464
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wang, Gang
>Priority: Major
>
> In our production environment, there are many bucket tables, which are used 
> to join with other tables. And there are some skewed joins now and then. 
> While, in current implementation, the skew join handling can only applied 
> when both sides of a SMJ are QueryStages. So skew join handling is not able 
> to deal with such cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32464) Support skew handling on join that has one side with no query stage

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32464:


Assignee: Apache Spark

> Support skew handling on join that has one side with no query stage
> ---
>
> Key: SPARK-32464
> URL: https://issues.apache.org/jira/browse/SPARK-32464
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wang, Gang
>Assignee: Apache Spark
>Priority: Major
>
> In our production environment, there are many bucket tables, which are used 
> to join with other tables. And there are some skewed joins now and then. 
> While, in current implementation, the skew join handling can only applied 
> when both sides of a SMJ are QueryStages. So skew join handling is not able 
> to deal with such cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32464) Support skew handling on join that has one side with no query stage

2020-07-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32464:


Assignee: Apache Spark

> Support skew handling on join that has one side with no query stage
> ---
>
> Key: SPARK-32464
> URL: https://issues.apache.org/jira/browse/SPARK-32464
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wang, Gang
>Assignee: Apache Spark
>Priority: Major
>
> In our production environment, there are many bucket tables, which are used 
> to join with other tables. And there are some skewed joins now and then. 
> While, in current implementation, the skew join handling can only applied 
> when both sides of a SMJ are QueryStages. So skew join handling is not able 
> to deal with such cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >