[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189004#comment-17189004
 ] 

Apache Spark commented on SPARK-24528:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/29625

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ohad Raviv
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-24528#Closely related to  
> SPARK-24410, we're trying to optimize a very common use case we have of 
> getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24528:


Assignee: Apache Spark

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ohad Raviv
>Assignee: Apache Spark
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-24528#Closely related to  
> SPARK-24410, we're trying to optimize a very common use case we have of 
> getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24528:


Assignee: (was: Apache Spark)

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ohad Raviv
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-24528#Closely related to  
> SPARK-24410, we're trying to optimize a very common use case we have of 
> getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32767) Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188970#comment-17188970
 ] 

Apache Spark commented on SPARK-32767:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29624

> Bucket join should work if spark.sql.shuffle.partitions larger than bucket 
> number
> -
>
> Key: SPARK-32767
> URL: https://issues.apache.org/jira/browse/SPARK-32767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1")
> spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2")
> sql("set spark.sql.shuffle.partitions=600")
> sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> sql("select * from t1 join t2 on t1.id = t2.id").explain()
> {code}
> {noformat}
> == Physical Plan ==
> *(5) SortMergeJoin [id#26L], [id#27L], Inner
> :- *(2) Sort [id#26L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(id#26L, 600), true, [id=#65]
> : +- *(1) Filter isnotnull(id#26L)
> :+- *(1) ColumnarToRow
> :   +- FileScan parquet default.t1[id#26L] Batched: true, 
> DataFilters: [isnotnull(id#26L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 432 out of 432
> +- *(4) Sort [id#27L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#27L, 600), true, [id=#74]
>   +- *(3) Filter isnotnull(id#27L)
>  +- *(3) ColumnarToRow
> +- FileScan parquet default.t2[id#27L] Batched: true, 
> DataFilters: [isnotnull(id#27L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 34 out of 34
> {noformat}
> *Expected*:
> {noformat}
> == Physical Plan ==
> *(4) SortMergeJoin [id#26L], [id#27L], Inner
> :- *(1) Sort [id#26L ASC NULLS FIRST], false, 0
> :  +- *(1) Filter isnotnull(id#26L)
> : +- *(1) ColumnarToRow
> :+- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: 
> [isnotnull(id#26L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 432 out of 432
> +- *(3) Sort [id#27L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#27L, 432), true, [id=#69]
>   +- *(2) Filter isnotnull(id#27L)
>  +- *(2) ColumnarToRow
> +- FileScan parquet default.t2[id#27L] Batched: true, 
> DataFilters: [isnotnull(id#27L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 34 out of 34
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-09-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188958#comment-17188958
 ] 

Dongjoon Hyun commented on SPARK-32691:
---

You don't need to be sorry for not fixing the issue. Reporting a bug is an 
invaluable contribution. It's just difficult for most community members to 
reproduce/develop/test on ARM.

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Priority: Major
> Attachments: failure.log, success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188913#comment-17188913
 ] 

Apache Spark commented on SPARK-32776:
--

User 'liwensun' has created a pull request for this issue:
https://github.com/apache/spark/pull/29623

> Limit in streaming should not be optimized away by PropagateEmptyRelation
> -
>
> Key: SPARK-32776
> URL: https://issues.apache.org/jira/browse/SPARK-32776
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Liwen Sun
>Priority: Major
>
> Right now, the limit operator in a streaming query may get optimized away 
> when the relation is empty. This can be problematic for stateful streaming, 
> as this empty batch will not write any state store files, and the next batch 
> will fail when trying to read these state store files and throw a file not 
> found error.
> We should not let PropagateEmptyRelation optimize away the Limit operator for 
> streaming queries.
> This ticket is intended to apply a small and safe fix for 
> PropagateEmptyRelation. A fundamental fix that can prevent this from 
> happening again in the future and in other optimizer rules is more desirable, 
> but that's a much larger task.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32776:


Assignee: Apache Spark

> Limit in streaming should not be optimized away by PropagateEmptyRelation
> -
>
> Key: SPARK-32776
> URL: https://issues.apache.org/jira/browse/SPARK-32776
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Liwen Sun
>Assignee: Apache Spark
>Priority: Major
>
> Right now, the limit operator in a streaming query may get optimized away 
> when the relation is empty. This can be problematic for stateful streaming, 
> as this empty batch will not write any state store files, and the next batch 
> will fail when trying to read these state store files and throw a file not 
> found error.
> We should not let PropagateEmptyRelation optimize away the Limit operator for 
> streaming queries.
> This ticket is intended to apply a small and safe fix for 
> PropagateEmptyRelation. A fundamental fix that can prevent this from 
> happening again in the future and in other optimizer rules is more desirable, 
> but that's a much larger task.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32776:


Assignee: (was: Apache Spark)

> Limit in streaming should not be optimized away by PropagateEmptyRelation
> -
>
> Key: SPARK-32776
> URL: https://issues.apache.org/jira/browse/SPARK-32776
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Liwen Sun
>Priority: Major
>
> Right now, the limit operator in a streaming query may get optimized away 
> when the relation is empty. This can be problematic for stateful streaming, 
> as this empty batch will not write any state store files, and the next batch 
> will fail when trying to read these state store files and throw a file not 
> found error.
> We should not let PropagateEmptyRelation optimize away the Limit operator for 
> streaming queries.
> This ticket is intended to apply a small and safe fix for 
> PropagateEmptyRelation. A fundamental fix that can prevent this from 
> happening again in the future and in other optimizer rules is more desirable, 
> but that's a much larger task.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32746) Not able to run Pandas UDF

2020-09-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188907#comment-17188907
 ] 

Hyukjin Kwon commented on SPARK-32746:
--

It's still not clear to say from the logs. Can you run other PySpark codes or 
just a regular Python UDF (instead of pandas UDF)?

> Not able to run Pandas UDF 
> ---
>
> Key: SPARK-32746
> URL: https://issues.apache.org/jira/browse/SPARK-32746
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Pyspark 3.0.0
> PyArrow - 1.0.1(also tried with Pyarrrow 0.15.1, no progress there)
> Pandas - 0.25.3
>  
>Reporter: Rahul Bhatia
>Priority: Major
> Attachments: Screenshot 2020-08-31 at 9.04.07 AM.png
>
>
> Hi,
> I am facing issues in running Pandas UDF on a yarn cluster with multiple 
> nodes, I am trying to perform a simple DBSCAN algorithm to multiple groups in 
> my dataframe, to start with, I am just using a simple example to test things 
> out - 
> {code:python}
> import pandas as pd
> from pyspark.sql.types import StructType, StructField, DoubleType, 
> StringType, IntegerType
> from sklearn.cluster import DBSCAN
> from pyspark.sql.functions import pandas_udf, PandasUDFTypedata 
> data = [(1, 11.6133, 48.1075),
>  (1, 11.6142, 48.1066),
>  (1, 11.6108, 48.1061),
>  (1, 11.6207, 48.1192),
>  (1, 11.6221, 48.1223),
>  (1, 11.5969, 48.1276),
>  (2, 11.5995, 48.1258),
>  (2, 11.6127, 48.1066),
>  (2, 11.6430, 48.1275),
>  (2, 11.6368, 48.1278),
>  (2, 11.5930, 48.1156)]
> df = spark.createDataFrame(data, ["id", "X", "Y"])
> output_schema = StructType(
> [
> StructField('id', IntegerType()),
> StructField('X', DoubleType()),
> StructField('Y', DoubleType()),
> StructField('cluster', IntegerType())
>  ]
> )
> @pandas_udf(output_schema, PandasUDFType.GROUPED_MAP)
> def dbscan(data):
> data["cluster"] = DBSCAN(eps=5, min_samples=3).fit_predict(data[["X", 
> "Y"]])
> result = pd.DataFrame(data, columns=["id", "X", "Y", "cluster"])
> return result
> res = df.groupby("id").apply(dbscan)
> res.show()
> {code}
>  
> The code keeps running forever on the yarn cluster, I expect it to be 
> finished within seconds(this works fine on standalone mode and finishes in 
> 2-4 seconds), on checking the Spark UI, I can see that the Spark job is 
> stuck(99/580) and doesn't make any progress forever.
>  
> Also it doesn't run in parallel, am I missing something?  !Screenshot 
> 2020-08-31 at 9.04.07 AM.png!
>  
>  
> I am new to Spark, and still trying to understand a lot of things. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32776) Limit in streaming should not be optimized away by PropagateEmptyRelation

2020-09-01 Thread Liwen Sun (Jira)
Liwen Sun created SPARK-32776:
-

 Summary: Limit in streaming should not be optimized away by 
PropagateEmptyRelation
 Key: SPARK-32776
 URL: https://issues.apache.org/jira/browse/SPARK-32776
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Liwen Sun


Right now, the limit operator in a streaming query may get optimized away when 
the relation is empty. This can be problematic for stateful streaming, as this 
empty batch will not write any state store files, and the next batch will fail 
when trying to read these state store files and throw a file not found error.

We should not let PropagateEmptyRelation optimize away the Limit operator for 
streaming queries.

This ticket is intended to apply a small and safe fix for 
PropagateEmptyRelation. A fundamental fix that can prevent this from happening 
again in the future and in other optimizer rules is more desirable, but that's 
a much larger task.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition

2020-09-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188904#comment-17188904
 ] 

Hyukjin Kwon commented on SPARK-32758:
--

Does this happen when you read a file from the HDFS with the real cluster?

> Spark ignores limit(1) and starts tasks for all partition
> -
>
> Key: SPARK-32758
> URL: https://issues.apache.org/jira/browse/SPARK-32758
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ivan Tsukanov
>Priority: Major
> Attachments: image-2020-09-01-10-51-09-417.png
>
>
> If we run the following code
> {code:scala}
>   val sparkConf = new SparkConf()
> .setAppName("test-app")
> .setMaster("local[1]")
>   val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
>   import sparkSession.implicits._
>   val df = (1 to 10)
> .toDF("c1")
> .repartition(1000)
>   implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)
>   df.limit(1)
> .map(identity)
> .collect()
>   df.map(identity)
> .limit(1)
> .collect()
>   Thread.sleep(10)
> {code}
> we will see that in the first case spark started 1002 tasks despite the fact 
> there is limit(1) -
> !image-2020-09-01-10-51-09-417.png!
> Expected behavior - both scenarios (limit before and after map) will produce 
> the same results - one or two tasks to get one value from the DataFrame.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition

2020-09-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188903#comment-17188903
 ] 

Hyukjin Kwon commented on SPARK-32758:
--

I think this is because you're creating the DataFrame from the local 
collection. Does that happen when you read a file from a HDFS?

> Spark ignores limit(1) and starts tasks for all partition
> -
>
> Key: SPARK-32758
> URL: https://issues.apache.org/jira/browse/SPARK-32758
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ivan Tsukanov
>Priority: Major
> Attachments: image-2020-09-01-10-51-09-417.png
>
>
> If we run the following code
> {code:scala}
>   val sparkConf = new SparkConf()
> .setAppName("test-app")
> .setMaster("local[1]")
>   val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
>   import sparkSession.implicits._
>   val df = (1 to 10)
> .toDF("c1")
> .repartition(1000)
>   implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)
>   df.limit(1)
> .map(identity)
> .collect()
>   df.map(identity)
> .limit(1)
> .collect()
>   Thread.sleep(10)
> {code}
> we will see that in the first case spark started 1002 tasks despite the fact 
> there is limit(1) -
> !image-2020-09-01-10-51-09-417.png!
> Expected behavior - both scenarios (limit before and after map) will produce 
> the same results - one or two tasks to get one value from the DataFrame.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition

2020-09-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32758:
-
Comment: was deleted

(was: I think this is because you're creating the DataFrame from the local 
collection. Does that happen when you read a file from a HDFS?)

> Spark ignores limit(1) and starts tasks for all partition
> -
>
> Key: SPARK-32758
> URL: https://issues.apache.org/jira/browse/SPARK-32758
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ivan Tsukanov
>Priority: Major
> Attachments: image-2020-09-01-10-51-09-417.png
>
>
> If we run the following code
> {code:scala}
>   val sparkConf = new SparkConf()
> .setAppName("test-app")
> .setMaster("local[1]")
>   val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
>   import sparkSession.implicits._
>   val df = (1 to 10)
> .toDF("c1")
> .repartition(1000)
>   implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)
>   df.limit(1)
> .map(identity)
> .collect()
>   df.map(identity)
> .limit(1)
> .collect()
>   Thread.sleep(10)
> {code}
> we will see that in the first case spark started 1002 tasks despite the fact 
> there is limit(1) -
> !image-2020-09-01-10-51-09-417.png!
> Expected behavior - both scenarios (limit before and after map) will produce 
> the same results - one or two tasks to get one value from the DataFrame.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32760) Support for INET data type

2020-09-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32760.
--
Resolution: Later

> Support for INET data type
> --
>
> Key: SPARK-32760
> URL: https://issues.apache.org/jira/browse/SPARK-32760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> PostgreSQL has support for `INET` data type 
> [https://www.postgresql.org/docs/9.1/datatype-net-types.html]
> We have a few customers that are interested in similar, native support for IP 
> addresses, just like in PostgreSQL.
> The issue with storing IP addresses as strings, is that most of the matches 
> (like if an IP address belong to a subnet) in most cases can't take leverage 
> of parquet bloom filters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32760) Support for INET data type

2020-09-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188902#comment-17188902
 ] 

Hyukjin Kwon commented on SPARK-32760:
--

The problem is that you should implement the serde for Python and R sides as 
well to make it properly supported. This is a huge work. Let's don't do this 
unless there's a very strong reason and very wide needs from the community.


> Support for INET data type
> --
>
> Key: SPARK-32760
> URL: https://issues.apache.org/jira/browse/SPARK-32760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> PostgreSQL has support for `INET` data type 
> [https://www.postgresql.org/docs/9.1/datatype-net-types.html]
> We have a few customers that are interested in similar, native support for IP 
> addresses, just like in PostgreSQL.
> The issue with storing IP addresses as strings, is that most of the matches 
> (like if an IP address belong to a subnet) in most cases can't take leverage 
> of parquet bloom filters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32766) s3a: bucket names with dots cannot be used

2020-09-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32766.
--
Resolution: Invalid

This doesn't looks like a problem in Spark.

> s3a: bucket names with dots cannot be used
> --
>
> Key: SPARK-32766
> URL: https://issues.apache.org/jira/browse/SPARK-32766
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> Running vanilla spark with
> {noformat}
> --packages=org.apache.hadoop:hadoop-aws:x.y.z{noformat}
> I cannot read from S3, if the bucket name contains a dot (a valid name).
> A minimal reproducible example looks like this
> {{from pyspark.sql import SparkSession}}
> {{import pyspark.sql.functions as f}}
> {{if __name__ == '__main__':}}
> {{  spark = (SparkSession}}
> {{    .builder}}
> {{    .appName('my_app')}}
> {{    .master("local[*]")}}
> {{    .getOrCreate()}}
> {{  )}}
> {{  spark.read.csv("s3a://test-bucket-name-v1.0/foo.csv")}}
> Or just launch a spark-shell with `--packages=(...)hadoop-aws(...)` and read 
> that CSV. I created the same bucket without the period and it worked fine.
> *Now I'm not sure whether this is a thing of prepping the path names and 
> passing them to the aws-sdk, or whether the fault is within the SDK itself. I 
> am not Java savvy to investigate the issue further, but I tried to make the 
> repro as short as possible.*
> 
> I get different errors depending on which Hadoop distributions I use. If I 
> use the default PySpark distribution (which includes Hadoop 2), I get the 
> following (using hadoop-aws:2.7.4)
> {{scala> spark.read.csv("s3a://okokes-test-v2.5/foo.csv").show()}}
> {{java.lang.IllegalArgumentException: The bucketName parameter must be 
> specified.}}
> {{ at 
> com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)}}
> {{ at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)}}
> {{ at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)}}
> {{ at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)}}
> {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)}}
> {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)}}
> {{ at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)}}
> {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)}}
> {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)}}
> {{ at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
> {{ at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
> {{ at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
> {{ at scala.Option.getOrElse(Option.scala:189)}}
> {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
> {{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
> {{ ... 47 elided}}
> When I downloaded 3.0.0 with Hadoop 3 and ran a spark-shell there, I got this 
> error (with hadoop-aws:3.2.0):
> {{java.lang.NullPointerException: null uri host.}}
> {{ at java.base/java.util.Objects.requireNonNull(Objects.java:246)}}
> {{ at 
> org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)}}
> {{ at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)}}
> {{ at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)}}
> {{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)}}
> {{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)}}
> {{ at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)}}
> {{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)}}
> {{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)}}
> {{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)}}
> {{ at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
> {{ at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
> {{ at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
> {{ at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
> {{ at scala.Option.getOrElse(Option.scala:189)}}
> {{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
> {{ at 

[jira] [Resolved] (SPARK-32771) The example of expressions.Aggregator in Javadoc / Scaladoc is wrong

2020-09-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32771.
--
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29617
[https://github.com/apache/spark/pull/29617]

> The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
> 
>
> Key: SPARK-32771
> URL: https://issues.apache.org/jira/browse/SPARK-32771
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
>
> There is an example of expressions.Aggregator in Javadoc and Scaladoc like as 
> follows.
> {code:java}
> val customSummer =  new Aggregator[Data, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Data): Int = b + a.i
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(r: Int): Int = r
> }.toColumn(){code}
> But this example doesn't work because it doesn't define bufferEncoder and 
> outputEncoder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32774.
--
Target Version/s: 3.0.1, 3.1.0
  Resolution: Fixed

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32774:
-
Target Version/s:   (was: 3.0.1, 3.1.0)

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32774:
-
Fix Version/s: 3.1.0
   3.0.1

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-01 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188890#comment-17188890
 ] 

Hyukjin Kwon commented on SPARK-32774:
--

Fixed in https://github.com/apache/spark/pull/29622

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32767) Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

2020-09-01 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32767:

Affects Version/s: (was: 3.1.0)
   3.0.0

> Bucket join should work if spark.sql.shuffle.partitions larger than bucket 
> number
> -
>
> Key: SPARK-32767
> URL: https://issues.apache.org/jira/browse/SPARK-32767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1")
> spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2")
> sql("set spark.sql.shuffle.partitions=600")
> sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> sql("select * from t1 join t2 on t1.id = t2.id").explain()
> {code}
> {noformat}
> == Physical Plan ==
> *(5) SortMergeJoin [id#26L], [id#27L], Inner
> :- *(2) Sort [id#26L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(id#26L, 600), true, [id=#65]
> : +- *(1) Filter isnotnull(id#26L)
> :+- *(1) ColumnarToRow
> :   +- FileScan parquet default.t1[id#26L] Batched: true, 
> DataFilters: [isnotnull(id#26L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 432 out of 432
> +- *(4) Sort [id#27L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#27L, 600), true, [id=#74]
>   +- *(3) Filter isnotnull(id#27L)
>  +- *(3) ColumnarToRow
> +- FileScan parquet default.t2[id#27L] Batched: true, 
> DataFilters: [isnotnull(id#27L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 34 out of 34
> {noformat}
> *Expected*:
> {noformat}
> == Physical Plan ==
> *(4) SortMergeJoin [id#26L], [id#27L], Inner
> :- *(1) Sort [id#26L ASC NULLS FIRST], false, 0
> :  +- *(1) Filter isnotnull(id#26L)
> : +- *(1) ColumnarToRow
> :+- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: 
> [isnotnull(id#26L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 432 out of 432
> +- *(3) Sort [id#27L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#27L, 432), true, [id=#69]
>   +- *(2) Filter isnotnull(id#27L)
>  +- *(2) ColumnarToRow
> +- FileScan parquet default.t2[id#27L] Batched: true, 
> DataFilters: [isnotnull(id#27L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 34 out of 34
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2020-09-01 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188879#comment-17188879
 ] 

Jungtaek Lim commented on SPARK-32530:
--

I'm not sure the relation between supports for vendors and this issue. Could 
you elaborate the point?

Maintenance cost is the one of critical factors for the project's health. 
Vendors are spending their own efforts to provide supports for their package. 
ASF project is running with volunteers' efforts, and the efforts are limited.

> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
> separate SPIP.
> There is no goal to provide support for Apache Spark < 3.0.0.
> h2. Current implementation
> A working prototype is available at 
> [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
> JetBrains and by early adopters.
> h2. What are the risks?
> There is always a risk that this product won’t get enough popularity and will 
> bring more costs than benefits. It can be mitigated by the fact that we don't 
> need to change any existing API and support can be potentially dropped at any 
> time.
> We also believe that existing API is rather low maintenance. It does not 
> bring anything more complex than already exists in the Spark codebase. 
> Furthermore, the implementation is compact - less than 2000 lines of code.
> We are committed to maintaining, improving and evolving the API based on 
> feedback from both Spark and Kotlin communities. As the Kotlin data community 
> continues to grow, we see Kotlin API for Apache Spark as an important part in 
> the evolving Kotlin ecosystem, and intend to fully support it. 
> h2. How long will it take?
> A  working implementation is already available, and if the community will 
> have any proposal of changes for this implementation to be improved, these 
> can be implemented quickly — in weeks if not days.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32775) [k8s] Spark client dependency support ignores non-local paths

2020-09-01 Thread Xuzhou Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuzhou Yin updated SPARK-32775:
---
Description: 
According to the logic of this line: 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,]
 Spark filters out all paths which are not local (ie. no scheme or 
[file://|file:///] scheme). It may cause non-local dependencies not loaded by 
Driver.

For example, when starting a Spark job with 
spark.jars=*local*:///local/path/1.jar,*s3*://s3/path/2.jar,*file*:///local/path/3.jar,
 it seems like this logic will upload *file*:///local/path/3.jar to s3, and 
reset spark.jars to only s3://transformed/path/3.jar, while completely ignoring 
local:///local/path/1.jar and s3:///s3/path/2.jar.

We need to fix this logic such that Spark uploads local files to S3, and 
transforms the paths while keeping all other paths as they are.

  was:
According to the logic of this line: 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,]
 Spark filters out all paths which are not local (ie. no scheme or 
[file://|file:///] scheme). It may cause non-local dependencies not loaded by 
Driver.

For example, when starting a Spark job with 
spark.jars=*local*:///local/path/1.jar,*s3*://s3/path/2.jar,*file*:///local/path/3.jar,
 it seems like this logic will upload *file*:///local/path/3.jar to s3, and 
reset spark.jars to only s3://transformed/path/3.jar, while completely ignoring 
local:///local/path/1.jar and s3:///s3/path/2.jar.

We need to fix this logic such that Spark upload local files to S3, and 
transform the paths while keeping all other paths as they are.


> [k8s] Spark client dependency support ignores non-local paths
> -
>
> Key: SPARK-32775
> URL: https://issues.apache.org/jira/browse/SPARK-32775
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Xuzhou Yin
>Priority: Major
>
> According to the logic of this line: 
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,]
>  Spark filters out all paths which are not local (ie. no scheme or 
> [file://|file:///] scheme). It may cause non-local dependencies not loaded by 
> Driver.
> For example, when starting a Spark job with 
> spark.jars=*local*:///local/path/1.jar,*s3*://s3/path/2.jar,*file*:///local/path/3.jar,
>  it seems like this logic will upload *file*:///local/path/3.jar to s3, and 
> reset spark.jars to only s3://transformed/path/3.jar, while completely 
> ignoring local:///local/path/1.jar and s3:///s3/path/2.jar.
> We need to fix this logic such that Spark uploads local files to S3, and 
> transforms the paths while keeping all other paths as they are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32775) [k8s] Spark client dependency support ignores non-local paths

2020-09-01 Thread Xuzhou Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuzhou Yin updated SPARK-32775:
---
Description: 
According to the logic of this line: 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,]
 Spark filters out all paths which are not local (ie. no scheme or 
[file://|file:///] scheme). It may cause non-local dependencies not loaded by 
Driver.

For example, when starting a Spark job with 
spark.jars=*local*:///local/path/1.jar,*s3*://s3/path/2.jar,*file*:///local/path/3.jar,
 it seems like this logic will upload *file*:///local/path/3.jar to s3, and 
reset spark.jars to only s3://transformed/path/3.jar, while completely ignoring 
local:///local/path/1.jar and s3:///s3/path/2.jar.

We need to fix this logic such that Spark upload local files to S3, and 
transform the paths while keeping all other paths as they are.

  was:
According to the logic of this line: 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,]
 Spark filters out all paths which are not local (ie. no scheme or 
[file://|file:///] scheme). It may cause non-local dependencies not loaded by 
Driver.

For example, when starting a Spark job with 
spark.jars=local:///local/path/1.jar,s3://s3/path/2.jar,[file:///local/path/3.jar],
 it seems like this logic will upload [file:///local/path/3.jar] to s3, and 
reset spark.jars to only s3://upload/path/3.jar, while completely ignoring 
local:///local/path/1.jar and s3:///s3/path/2.jar.

We need to fix this logic such that Spark upload local files to S3, and 
transform the paths while keeping all other paths as they are.


> [k8s] Spark client dependency support ignores non-local paths
> -
>
> Key: SPARK-32775
> URL: https://issues.apache.org/jira/browse/SPARK-32775
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Xuzhou Yin
>Priority: Major
>
> According to the logic of this line: 
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,]
>  Spark filters out all paths which are not local (ie. no scheme or 
> [file://|file:///] scheme). It may cause non-local dependencies not loaded by 
> Driver.
> For example, when starting a Spark job with 
> spark.jars=*local*:///local/path/1.jar,*s3*://s3/path/2.jar,*file*:///local/path/3.jar,
>  it seems like this logic will upload *file*:///local/path/3.jar to s3, and 
> reset spark.jars to only s3://transformed/path/3.jar, while completely 
> ignoring local:///local/path/1.jar and s3:///s3/path/2.jar.
> We need to fix this logic such that Spark upload local files to S3, and 
> transform the paths while keeping all other paths as they are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32775) [k8s] Spark client dependency support ignores non-local paths

2020-09-01 Thread Xuzhou Yin (Jira)
Xuzhou Yin created SPARK-32775:
--

 Summary: [k8s] Spark client dependency support ignores non-local 
paths
 Key: SPARK-32775
 URL: https://issues.apache.org/jira/browse/SPARK-32775
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Xuzhou Yin


According to the logic of this line: 
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L161,]
 Spark filters out all paths which are not local (ie. no scheme or 
[file://|file:///] scheme). It may cause non-local dependencies not loaded by 
Driver.

For example, when starting a Spark job with 
spark.jars=local:///local/path/1.jar,s3://s3/path/2.jar,[file:///local/path/3.jar],
 it seems like this logic will upload [file:///local/path/3.jar] to s3, and 
reset spark.jars to only s3://upload/path/3.jar, while completely ignoring 
local:///local/path/1.jar and s3:///s3/path/2.jar.

We need to fix this logic such that Spark upload local files to S3, and 
transform the paths while keeping all other paths as they are.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32767) Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

2020-09-01 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32767:

Description: 
How to reproduce this issue:
{code:scala}
spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1")
spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2")
sql("set spark.sql.shuffle.partitions=600")
sql("set spark.sql.autoBroadcastJoinThreshold=-1")
sql("select * from t1 join t2 on t1.id = t2.id").explain()
{code}

{noformat}
== Physical Plan ==
*(5) SortMergeJoin [id#26L], [id#27L], Inner
:- *(2) Sort [id#26L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(id#26L, 600), true, [id=#65]
: +- *(1) Filter isnotnull(id#26L)
:+- *(1) ColumnarToRow
:   +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: 
[isnotnull(id#26L)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct, SelectedBucketsCount: 432 out of 432
+- *(4) Sort [id#27L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#27L, 600), true, [id=#74]
  +- *(3) Filter isnotnull(id#27L)
 +- *(3) ColumnarToRow
+- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: 
[isnotnull(id#27L)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct, SelectedBucketsCount: 34 out of 34

{noformat}

*Expected*:
{noformat}
== Physical Plan ==
*(4) SortMergeJoin [id#26L], [id#27L], Inner
:- *(1) Sort [id#26L ASC NULLS FIRST], false, 0
:  +- *(1) Filter isnotnull(id#26L)
: +- *(1) ColumnarToRow
:+- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: 
[isnotnull(id#26L)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct, SelectedBucketsCount: 432 out of 432
+- *(3) Sort [id#27L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#27L, 432), true, [id=#69]
  +- *(2) Filter isnotnull(id#27L)
 +- *(2) ColumnarToRow
+- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: 
[isnotnull(id#27L)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct, SelectedBucketsCount: 34 out of 34

{noformat}



  was:
How to reproduce this issue:
{code:scala}
spark.range(1000).write.bucketBy(500, "id").saveAsTable("t1")
spark.range(1000).write.bucketBy(50, "id").saveAsTable("t2")
sql("set spark.sql.shuffle.partitions=600")
sql("set spark.sql.autoBroadcastJoinThreshold=-1")
sql("select * from t1 join t2 on t1.id = t2.id").explain()
{code}


> Bucket join should work if spark.sql.shuffle.partitions larger than bucket 
> number
> -
>
> Key: SPARK-32767
> URL: https://issues.apache.org/jira/browse/SPARK-32767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1")
> spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2")
> sql("set spark.sql.shuffle.partitions=600")
> sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> sql("select * from t1 join t2 on t1.id = t2.id").explain()
> {code}
> {noformat}
> == Physical Plan ==
> *(5) SortMergeJoin [id#26L], [id#27L], Inner
> :- *(2) Sort [id#26L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(id#26L, 600), true, [id=#65]
> : +- *(1) Filter isnotnull(id#26L)
> :+- *(1) ColumnarToRow
> :   +- FileScan parquet default.t1[id#26L] Batched: true, 
> DataFilters: [isnotnull(id#26L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct, SelectedBucketsCount: 432 out of 432
> +- *(4) Sort [id#27L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#27L, 600), true, [id=#74]
>   +- *(3) Filter isnotnull(id#27L)
>  +- *(3) ColumnarToRow
> +- FileScan parquet default.t2[id#27L] Batched: true, 
> DataFilters: [isnotnull(id#27L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32444/sql/core/spark-warehouse/org.apache.spark...,
>  

[jira] [Assigned] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32774:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32774:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188827#comment-17188827
 ] 

Apache Spark commented on SPARK-32774:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29622

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188828#comment-17188828
 ] 

Apache Spark commented on SPARK-32774:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29622

> Don't track docs/.jekyll-cache
> --
>
> Key: SPARK-32774
> URL: https://issues.apache.org/jira/browse/SPARK-32774
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
> tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32774) Don't track docs/.jekyll-cache

2020-09-01 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-32774:
--

 Summary: Don't track docs/.jekyll-cache
 Key: SPARK-32774
 URL: https://issues.apache.org/jira/browse/SPARK-32774
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


I noticed sometimes, docs/.jekyll-cache can be created and it should not be 
tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188826#comment-17188826
 ] 

Apache Spark commented on SPARK-32119:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29621

> ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
> --
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> ExecutorPlugin can't work with Standalone Cluster and Kubernetes
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188825#comment-17188825
 ] 

Apache Spark commented on SPARK-32119:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29621

> ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
> --
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> ExecutorPlugin can't work with Standalone Cluster and Kubernetes
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2020-09-01 Thread Dennis Jaheruddin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188816#comment-17188816
 ] 

Dennis Jaheruddin commented on SPARK-32530:
---

Please note that a significant number of on prem users is running spark on HDP 
2 and HDP 3, they will be on 2.3.2. CDH 6 is on 2.4.0. (I won't even mention 
what version CDH5 users are on). In both cases there is a successor announced, 
but most hadoop users will be on lower versions than you plan to support for 
the next 12-18 months.

This may help if you want to tap into a larger audience.

> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
> separate SPIP.
> There is no goal to provide support for Apache Spark < 3.0.0.
> h2. Current implementation
> A working prototype is available at 
> [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
> JetBrains and by early adopters.
> h2. What are the risks?
> There is always a risk that this product won’t get enough popularity and will 
> bring more costs than benefits. It can be mitigated by the fact that we don't 
> need to change any existing API and support can be potentially dropped at any 
> time.
> We also believe that existing API is rather low maintenance. It does not 
> bring anything more complex than already exists in the Spark codebase. 
> Furthermore, the implementation is compact - less than 2000 lines of code.
> We are committed to maintaining, improving and evolving the API based on 
> feedback from both Spark and Kotlin communities. As the Kotlin data community 
> continues to grow, we see Kotlin API for Apache Spark as an important part in 
> the evolving Kotlin ecosystem, and intend to fully support it. 
> h2. How long will it take?
> A  working implementation is already available, and if the community will 
> have any proposal of changes for this implementation to be improved, these 
> can be implemented quickly — in weeks if not days.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-09-01 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188808#comment-17188808
 ] 

Kousuke Saruta commented on SPARK-32119:


Yeah, it's a bug fix so we may have a chance to backport this fix to 3.0.1.
I'll make a backport PR.

> ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
> --
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> ExecutorPlugin can't work with Standalone Cluster and Kubernetes
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32773) The behavior of listJars and listFiles is not consistent between YARN and other cluster managers

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188801#comment-17188801
 ] 

Apache Spark commented on SPARK-32773:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29620

> The behavior of listJars and listFiles is not consistent between YARN and 
> other cluster managers
> 
>
> Key: SPARK-32773
> URL: https://issues.apache.org/jira/browse/SPARK-32773
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Jars/files specified with --jars / --files options are listed by sc.listJars 
> and listFiles except when we run apps on YARN.
> If we run apps not on YARN, those files are served by the embedded file 
> server in the driver and listJars/listFiles list the served files.
> But with YARN, such files specified by the options are not served by the 
> embedded file server so listJars and listFiles don't list them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32773) The behavior of listJars and listFiles is not consistent between YARN and other cluster managers

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188800#comment-17188800
 ] 

Apache Spark commented on SPARK-32773:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29620

> The behavior of listJars and listFiles is not consistent between YARN and 
> other cluster managers
> 
>
> Key: SPARK-32773
> URL: https://issues.apache.org/jira/browse/SPARK-32773
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Jars/files specified with --jars / --files options are listed by sc.listJars 
> and listFiles except when we run apps on YARN.
> If we run apps not on YARN, those files are served by the embedded file 
> server in the driver and listJars/listFiles list the served files.
> But with YARN, such files specified by the options are not served by the 
> embedded file server so listJars and listFiles don't list them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32773) The behavior of listJars and listFiles is not consistent between YARN and other cluster managers

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32773:


Assignee: Apache Spark  (was: Kousuke Saruta)

> The behavior of listJars and listFiles is not consistent between YARN and 
> other cluster managers
> 
>
> Key: SPARK-32773
> URL: https://issues.apache.org/jira/browse/SPARK-32773
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> Jars/files specified with --jars / --files options are listed by sc.listJars 
> and listFiles except when we run apps on YARN.
> If we run apps not on YARN, those files are served by the embedded file 
> server in the driver and listJars/listFiles list the served files.
> But with YARN, such files specified by the options are not served by the 
> embedded file server so listJars and listFiles don't list them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32773) The behavior of listJars and listFiles is not consistent between YARN and other cluster managers

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32773:


Assignee: Kousuke Saruta  (was: Apache Spark)

> The behavior of listJars and listFiles is not consistent between YARN and 
> other cluster managers
> 
>
> Key: SPARK-32773
> URL: https://issues.apache.org/jira/browse/SPARK-32773
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Jars/files specified with --jars / --files options are listed by sc.listJars 
> and listFiles except when we run apps on YARN.
> If we run apps not on YARN, those files are served by the embedded file 
> server in the driver and listJars/listFiles list the served files.
> But with YARN, such files specified by the options are not served by the 
> embedded file server so listJars and listFiles don't list them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32773) The behavior of listJars and listFiles is not consistent between YARN and other cluster managers

2020-09-01 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32773:
---
Description: 
Jars/files specified with --jars / --files options are listed by sc.listJars 
and listFiles except when we run apps on YARN.
If we run apps not on YARN, those files are served by the embedded file server 
in the driver and listJars/listFiles list the served files.
But with YARN, such files specified by the options are not served by the 
embedded file server so listJars and listFiles don't list them.

  was:
Jars/files specified with --jars / --files options are listed by sc.listJars 
and listFiles except when we run apps on YARN.

If we run apps not on YARN, those files are served by the embedded file server 
in the driver and listJars/listFiles list the served files.

But with YARN, such files specified by the options are not served by the 
embedded file server so listJars and listFiles don't list them.


> The behavior of listJars and listFiles is not consistent between YARN and 
> other cluster managers
> 
>
> Key: SPARK-32773
> URL: https://issues.apache.org/jira/browse/SPARK-32773
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Jars/files specified with --jars / --files options are listed by sc.listJars 
> and listFiles except when we run apps on YARN.
> If we run apps not on YARN, those files are served by the embedded file 
> server in the driver and listJars/listFiles list the served files.
> But with YARN, such files specified by the options are not served by the 
> embedded file server so listJars and listFiles don't list them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32773) The behavior of listJars and listFiles is not consistent between YARN and other cluster managers

2020-09-01 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32773:
---
Description: 
Jars/files specified with --jars / --files options are listed by sc.listJars 
and listFiles except when we run apps on YARN.

If we run apps not on YARN, those files are served by the embedded file server 
in the driver and listJars/listFiles list the served files.

But with YARN, such files specified by the options are not served by the 
embedded file server so listJars and listFiles don't list them.

  was:
Jars/files specified with --jars/--files options are listed by sc.listJars and 
listFiles except when we run apps on YARN.

If we run apps not on YARN, those files are served by the embedded file server 
in the driver and listJars/listFiles list the served files.

But with YARN, such files specified by the options are not served by the 
embedded file server so listJars and listFiles don't list them.


> The behavior of listJars and listFiles is not consistent between YARN and 
> other cluster managers
> 
>
> Key: SPARK-32773
> URL: https://issues.apache.org/jira/browse/SPARK-32773
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Jars/files specified with --jars / --files options are listed by sc.listJars 
> and listFiles except when we run apps on YARN.
> If we run apps not on YARN, those files are served by the embedded file 
> server in the driver and listJars/listFiles list the served files.
> But with YARN, such files specified by the options are not served by the 
> embedded file server so listJars and listFiles don't list them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32773) The behavior of listJars and listFiles is not consistent between YARN and other cluster managers

2020-09-01 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32773:
---
Summary: The behavior of listJars and listFiles is not consistent between 
YARN and other cluster managers  (was: The behavior of listJars and listFiles 
is not consistent with YARN and other cluster managers)

> The behavior of listJars and listFiles is not consistent between YARN and 
> other cluster managers
> 
>
> Key: SPARK-32773
> URL: https://issues.apache.org/jira/browse/SPARK-32773
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Jars/files specified with --jars/--files options are listed by sc.listJars 
> and listFiles except when we run apps on YARN.
> If we run apps not on YARN, those files are served by the embedded file 
> server in the driver and listJars/listFiles list the served files.
> But with YARN, such files specified by the options are not served by the 
> embedded file server so listJars and listFiles don't list them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32773) The behavior of listJars and listFiles is not consistent with YARN and other cluster managers

2020-09-01 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-32773:
--

 Summary: The behavior of listJars and listFiles is not consistent 
with YARN and other cluster managers
 Key: SPARK-32773
 URL: https://issues.apache.org/jira/browse/SPARK-32773
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


Jars/files specified with --jars/--files options are listed by sc.listJars and 
listFiles except when we run apps on YARN.

If we run apps not on YARN, those files are served by the embedded file server 
in the driver and listJars/listFiles list the served files.

But with YARN, such files specified by the options are not served by the 
embedded file server so listJars and listFiles don't list them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32772) Reduce log messages for spark-sql CLI

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188782#comment-17188782
 ] 

Apache Spark commented on SPARK-32772:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29619

> Reduce log messages for spark-sql CLI
> -
>
> Key: SPARK-32772
> URL: https://issues.apache.org/jira/browse/SPARK-32772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When we launch spark-sql CLI, too many log messages are shown and it's 
> sometimes difficult to find the result of query.
> So I think it's better to reduce log messages like spark-shell and pyspark 
> CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32772) Reduce log messages for spark-sql CLI

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32772:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Reduce log messages for spark-sql CLI
> -
>
> Key: SPARK-32772
> URL: https://issues.apache.org/jira/browse/SPARK-32772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When we launch spark-sql CLI, too many log messages are shown and it's 
> sometimes difficult to find the result of query.
> So I think it's better to reduce log messages like spark-shell and pyspark 
> CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32772) Reduce log messages for spark-sql CLI

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188781#comment-17188781
 ] 

Apache Spark commented on SPARK-32772:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29619

> Reduce log messages for spark-sql CLI
> -
>
> Key: SPARK-32772
> URL: https://issues.apache.org/jira/browse/SPARK-32772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When we launch spark-sql CLI, too many log messages are shown and it's 
> sometimes difficult to find the result of query.
> So I think it's better to reduce log messages like spark-shell and pyspark 
> CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32772) Reduce log messages for spark-sql CLI

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32772:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Reduce log messages for spark-sql CLI
> -
>
> Key: SPARK-32772
> URL: https://issues.apache.org/jira/browse/SPARK-32772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> When we launch spark-sql CLI, too many log messages are shown and it's 
> sometimes difficult to find the result of query.
> So I think it's better to reduce log messages like spark-shell and pyspark 
> CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32772) Reduce log messages for spark-sql CLI

2020-09-01 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32772:
---
Description: 
When we launch spark-sql CLI, too many log messages are shown and it's 
sometimes difficult to find the result of query.

So I think it's better to reduce log messages like spark-shell and pyspark CLI.

  was:
When we launch spark-sql CLI, too many log messages are shown and it's 
sometimes difficult to find the result of query.

So I think it's better to suppress log like spark-shell and pyspark CLI.


> Reduce log messages for spark-sql CLI
> -
>
> Key: SPARK-32772
> URL: https://issues.apache.org/jira/browse/SPARK-32772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When we launch spark-sql CLI, too many log messages are shown and it's 
> sometimes difficult to find the result of query.
> So I think it's better to reduce log messages like spark-shell and pyspark 
> CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32772) Reduce log messages for spark-sql CLI

2020-09-01 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32772:
---
Summary: Reduce log messages for spark-sql CLI  (was: Suppress log for 
spark-sql CLI)

> Reduce log messages for spark-sql CLI
> -
>
> Key: SPARK-32772
> URL: https://issues.apache.org/jira/browse/SPARK-32772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When we launch spark-sql CLI, too many log messages are shown and it's 
> sometimes difficult to find the result of query.
> So I think it's better to suppress log like spark-shell and pyspark CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32772) Suppress log for spark-sql CLI

2020-09-01 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-32772:
--

 Summary: Suppress log for spark-sql CLI
 Key: SPARK-32772
 URL: https://issues.apache.org/jira/browse/SPARK-32772
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


When we launch spark-sql CLI, too many log messages are shown and it's 
sometimes difficult to find the result of query.

So I think it's better to suppress log like spark-shell and pyspark CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188779#comment-17188779
 ] 

Apache Spark commented on SPARK-19320:
--

User 'farhan5900' has created a pull request for this issue:
https://github.com/apache/spark/pull/29618

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Priority: Major
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188778#comment-17188778
 ] 

Apache Spark commented on SPARK-19320:
--

User 'farhan5900' has created a pull request for this issue:
https://github.com/apache/spark/pull/29618

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Priority: Major
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32771) The example of expressions.Aggregator in Javadoc / Scaladoc is wrong

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188777#comment-17188777
 ] 

Apache Spark commented on SPARK-32771:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29617

> The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
> 
>
> Key: SPARK-32771
> URL: https://issues.apache.org/jira/browse/SPARK-32771
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> There is an example of expressions.Aggregator in Javadoc and Scaladoc like as 
> follows.
> {code:java}
> val customSummer =  new Aggregator[Data, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Data): Int = b + a.i
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(r: Int): Int = r
> }.toColumn(){code}
> But this example doesn't work because it doesn't define bufferEncoder and 
> outputEncoder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32771) The example of expressions.Aggregator in Javadoc / Scaladoc is wrong

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188776#comment-17188776
 ] 

Apache Spark commented on SPARK-32771:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/29617

> The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
> 
>
> Key: SPARK-32771
> URL: https://issues.apache.org/jira/browse/SPARK-32771
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> There is an example of expressions.Aggregator in Javadoc and Scaladoc like as 
> follows.
> {code:java}
> val customSummer =  new Aggregator[Data, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Data): Int = b + a.i
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(r: Int): Int = r
> }.toColumn(){code}
> But this example doesn't work because it doesn't define bufferEncoder and 
> outputEncoder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32771) The example of expressions.Aggregator in Javadoc / Scaladoc is wrong

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32771:


Assignee: Apache Spark  (was: Kousuke Saruta)

> The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
> 
>
> Key: SPARK-32771
> URL: https://issues.apache.org/jira/browse/SPARK-32771
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> There is an example of expressions.Aggregator in Javadoc and Scaladoc like as 
> follows.
> {code:java}
> val customSummer =  new Aggregator[Data, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Data): Int = b + a.i
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(r: Int): Int = r
> }.toColumn(){code}
> But this example doesn't work because it doesn't define bufferEncoder and 
> outputEncoder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32771) The example of expressions.Aggregator in Javadoc / Scaladoc is wrong

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32771:


Assignee: Kousuke Saruta  (was: Apache Spark)

> The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
> 
>
> Key: SPARK-32771
> URL: https://issues.apache.org/jira/browse/SPARK-32771
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> There is an example of expressions.Aggregator in Javadoc and Scaladoc like as 
> follows.
> {code:java}
> val customSummer =  new Aggregator[Data, Int, Int] {
>   def zero: Int = 0
>   def reduce(b: Int, a: Data): Int = b + a.i
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(r: Int): Int = r
> }.toColumn(){code}
> But this example doesn't work because it doesn't define bufferEncoder and 
> outputEncoder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32771) The example of expressions.Aggregator in Javadoc / Scaladoc is wrong

2020-09-01 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-32771:
--

 Summary: The example of expressions.Aggregator in Javadoc / 
Scaladoc is wrong
 Key: SPARK-32771
 URL: https://issues.apache.org/jira/browse/SPARK-32771
 Project: Spark
  Issue Type: Bug
  Components: docs
Affects Versions: 3.0.0, 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


There is an example of expressions.Aggregator in Javadoc and Scaladoc like as 
follows.
{code:java}
val customSummer =  new Aggregator[Data, Int, Int] {
  def zero: Int = 0
  def reduce(b: Int, a: Data): Int = b + a.i
  def merge(b1: Int, b2: Int): Int = b1 + b2
  def finish(r: Int): Int = r
}.toColumn(){code}
But this example doesn't work because it doesn't define bufferEncoder and 
outputEncoder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32770) Add missing imports

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188764#comment-17188764
 ] 

Apache Spark commented on SPARK-32770:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/29616

> Add missing imports
> ---
>
> Key: SPARK-32770
> URL: https://issues.apache.org/jira/browse/SPARK-32770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32770) Add missing imports

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32770:


Assignee: (was: Apache Spark)

> Add missing imports
> ---
>
> Key: SPARK-32770
> URL: https://issues.apache.org/jira/browse/SPARK-32770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32770) Add missing imports

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188763#comment-17188763
 ] 

Apache Spark commented on SPARK-32770:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/29615

> Add missing imports
> ---
>
> Key: SPARK-32770
> URL: https://issues.apache.org/jira/browse/SPARK-32770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32770) Add missing imports

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32770:


Assignee: Apache Spark

> Add missing imports
> ---
>
> Key: SPARK-32770
> URL: https://issues.apache.org/jira/browse/SPARK-32770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Fokko Driesprong
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32770) Add missing imports

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188762#comment-17188762
 ] 

Apache Spark commented on SPARK-32770:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/29615

> Add missing imports
> ---
>
> Key: SPARK-32770
> URL: https://issues.apache.org/jira/browse/SPARK-32770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32770) Add missing imports

2020-09-01 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32770:


 Summary: Add missing imports
 Key: SPARK-32770
 URL: https://issues.apache.org/jira/browse/SPARK-32770
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0, 2.4.6
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32769) setting spark.network.timeout no longer sets spark.storage.blockManagerSlaveTimeoutMs

2020-09-01 Thread Tim Osborne (Jira)
Tim Osborne created SPARK-32769:
---

 Summary: setting spark.network.timeout no longer sets 
spark.storage.blockManagerSlaveTimeoutMs
 Key: SPARK-32769
 URL: https://issues.apache.org/jira/browse/SPARK-32769
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Tim Osborne


Setting spark.network.timeout no longer sets 
spark.storage.blockManagerSlaveTimeoutMs

 

Reproducible by setting spark.network.timeout=3600s and 
spark.executor.heartbeatInterval=1800s at startup. Workaround by setting 
spark.storage.blockManagerSlaveTimeoutMs=3600s at startup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188709#comment-17188709
 ] 

Apache Spark commented on SPARK-32765:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29614

> EliminateJoinToEmptyRelation should respect exchange behavior when 
> canChangeNumPartitions == false
> --
>
> Key: SPARK-32765
> URL: https://issues.apache.org/jira/browse/SPARK-32765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
> EmptyRelation in some cases with AQE on. But if either sub plan of Join 
> contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
> the Exchange produced by repartition Or singlePartition, in this case, if we 
> were to convert it into an EmptyRelation, it will lost user specified number 
> partition information for downstream operator, it's not right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32765:


Assignee: Apache Spark

> EliminateJoinToEmptyRelation should respect exchange behavior when 
> canChangeNumPartitions == false
> --
>
> Key: SPARK-32765
> URL: https://issues.apache.org/jira/browse/SPARK-32765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Assignee: Apache Spark
>Priority: Major
>
> Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
> EmptyRelation in some cases with AQE on. But if either sub plan of Join 
> contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
> the Exchange produced by repartition Or singlePartition, in this case, if we 
> were to convert it into an EmptyRelation, it will lost user specified number 
> partition information for downstream operator, it's not right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188706#comment-17188706
 ] 

Apache Spark commented on SPARK-32765:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29614

> EliminateJoinToEmptyRelation should respect exchange behavior when 
> canChangeNumPartitions == false
> --
>
> Key: SPARK-32765
> URL: https://issues.apache.org/jira/browse/SPARK-32765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
> EmptyRelation in some cases with AQE on. But if either sub plan of Join 
> contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
> the Exchange produced by repartition Or singlePartition, in this case, if we 
> were to convert it into an EmptyRelation, it will lost user specified number 
> partition information for downstream operator, it's not right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32765:


Assignee: (was: Apache Spark)

> EliminateJoinToEmptyRelation should respect exchange behavior when 
> canChangeNumPartitions == false
> --
>
> Key: SPARK-32765
> URL: https://issues.apache.org/jira/browse/SPARK-32765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
> EmptyRelation in some cases with AQE on. But if either sub plan of Join 
> contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
> the Exchange produced by repartition Or singlePartition, in this case, if we 
> were to convert it into an EmptyRelation, it will lost user specified number 
> partition information for downstream operator, it's not right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-01 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32765:

Environment: (was: Currently, EliminateJoinToEmptyRelation Rule will 
convert Join into EmptyRelation in some cases with AQE on. But if either sub 
plan of Join contains a ShuffleQueryStage(canChangeNumPartitions == false), 
which means the Exchange produced by repartition Or singlePartition, in this 
case, if we were to convert it into an EmptyRelation, it will lost user 
specified number partition information for downstream operator, it's not right. 

So in the Patch, try not to do the conversion if either sub plan of Join 
contains ShuffleQueryStage(canChangeNumPartitions == false))

> EliminateJoinToEmptyRelation should respect exchange behavior when 
> canChangeNumPartitions == false
> --
>
> Key: SPARK-32765
> URL: https://issues.apache.org/jira/browse/SPARK-32765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
> EmptyRelation in some cases with AQE on. But if either sub plan of Join 
> contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
> the Exchange produced by repartition Or singlePartition, in this case, if we 
> were to convert it into an EmptyRelation, it will lost user specified number 
> partition information for downstream operator, it's not right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-01 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32765:

Description: 
Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
EmptyRelation in some cases with AQE on. But if either sub plan of Join 
contains a ShuffleQueryStage(canChangeNumPartitions == false), which means the 
Exchange produced by repartition Or singlePartition, in this case, if we were 
to convert it into an EmptyRelation, it will lost user specified number 
partition information for downstream operator, it's not right. 

So in the Patch, try not to do the conversion if either sub plan of Join 
contains ShuffleQueryStage(canChangeNumPartitions == false)

> EliminateJoinToEmptyRelation should respect exchange behavior when 
> canChangeNumPartitions == false
> --
>
> Key: SPARK-32765
> URL: https://issues.apache.org/jira/browse/SPARK-32765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Currently, EliminateJoinToEmptyRelation Rule will 
> convert Join into EmptyRelation in some cases with AQE on. But if either sub 
> plan of Join contains a ShuffleQueryStage(canChangeNumPartitions == false), 
> which means the Exchange produced by repartition Or singlePartition, in this 
> case, if we were to convert it into an EmptyRelation, it will lost user 
> specified number partition information for downstream operator, it's not 
> right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
> EmptyRelation in some cases with AQE on. But if either sub plan of Join 
> contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
> the Exchange produced by repartition Or singlePartition, in this case, if we 
> were to convert it into an EmptyRelation, it will lost user specified number 
> partition information for downstream operator, it's not right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32768) Add Parquet Timestamp output configuration to docs

2020-09-01 Thread Ron DeFreitas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron DeFreitas updated SPARK-32768:
--
Description: 
{{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
option for controlling the underlying datatype used when writing Timestamp 
column types into parquet files. This option is helpful for compatibility with 
external systems that need to read the output from Spark.}}

{{This was never exposed in the documentation. Fix should be applied to docs 
for both the next 3.x release and 2.4.x release.}}

  was:
{{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
option for controlling the underlying datatype used when writing Timestamp 
column types into parquet files.}}

{{This was never exposed in the documentation. Fix should be applied to docs 
for both the next 3.x release and 2.4.x release.}}


> Add Parquet Timestamp output configuration to docs
> --
>
> Key: SPARK-32768
> URL: https://issues.apache.org/jira/browse/SPARK-32768
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Ron DeFreitas
>Priority: Minor
>  Labels: docs-missing, parquet
>
> {{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
> option for controlling the underlying datatype used when writing Timestamp 
> column types into parquet files. This option is helpful for compatibility 
> with external systems that need to read the output from Spark.}}
> {{This was never exposed in the documentation. Fix should be applied to docs 
> for both the next 3.x release and 2.4.x release.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32768) Add Parquet Timestamp output configuration to docs

2020-09-01 Thread Ron DeFreitas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron DeFreitas updated SPARK-32768:
--
Description: 
{{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
option for controlling the underlying datatype used when writing Timestamp 
column types into parquet files.}}

{{This was never exposed in the documentation. Fix should be applied to docs 
for both the next 3.x release and 2.4.x release.}}

  was:
{{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
option for controlling the underlying datatype used when writing Timestamp 
column types into parquet files.}}

{{This was never exposed in the documentation. Fix should be applied to docs 
for both to the next 3.x release and 2.4.x release.}}


> Add Parquet Timestamp output configuration to docs
> --
>
> Key: SPARK-32768
> URL: https://issues.apache.org/jira/browse/SPARK-32768
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Ron DeFreitas
>Priority: Minor
>  Labels: docs-missing, parquet
>
> {{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
> option for controlling the underlying datatype used when writing Timestamp 
> column types into parquet files.}}
> {{This was never exposed in the documentation. Fix should be applied to docs 
> for both the next 3.x release and 2.4.x release.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32768) Add Parquet Timestamp output configuration to docs

2020-09-01 Thread Ron DeFreitas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron DeFreitas updated SPARK-32768:
--
Description: 
{{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
option for controlling the underlying datatype used when writing Timestamp 
column types into parquet files.}}

{{This was never exposed in the documentation. Fix should be applied to docs 
for both to the next 3.x release and 2.4.x release.}}

  was:
{{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
option for controlling the underlying datatype used when writing Timestamp 
column types into parquet files.}}

{{This was never exposed in the documentation. Fix should be applied to docs 
for both to the next 3.x release and 2.4.x release.}}

{{PR for this fix pending.}}


> Add Parquet Timestamp output configuration to docs
> --
>
> Key: SPARK-32768
> URL: https://issues.apache.org/jira/browse/SPARK-32768
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Ron DeFreitas
>Priority: Minor
>  Labels: docs-missing, parquet
>
> {{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
> option for controlling the underlying datatype used when writing Timestamp 
> column types into parquet files.}}
> {{This was never exposed in the documentation. Fix should be applied to docs 
> for both to the next 3.x release and 2.4.x release.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32768) Add Parquet Timestamp output configuration to docs

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188646#comment-17188646
 ] 

Apache Spark commented on SPARK-32768:
--

User 'rdefreitas' has created a pull request for this issue:
https://github.com/apache/spark/pull/29613

> Add Parquet Timestamp output configuration to docs
> --
>
> Key: SPARK-32768
> URL: https://issues.apache.org/jira/browse/SPARK-32768
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Ron DeFreitas
>Priority: Minor
>  Labels: docs-missing, parquet
>
> {{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
> option for controlling the underlying datatype used when writing Timestamp 
> column types into parquet files.}}
> {{This was never exposed in the documentation. Fix should be applied to docs 
> for both to the next 3.x release and 2.4.x release.}}
> {{PR for this fix pending.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32768) Add Parquet Timestamp output configuration to docs

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32768:


Assignee: (was: Apache Spark)

> Add Parquet Timestamp output configuration to docs
> --
>
> Key: SPARK-32768
> URL: https://issues.apache.org/jira/browse/SPARK-32768
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Ron DeFreitas
>Priority: Minor
>  Labels: docs-missing, parquet
>
> {{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
> option for controlling the underlying datatype used when writing Timestamp 
> column types into parquet files.}}
> {{This was never exposed in the documentation. Fix should be applied to docs 
> for both to the next 3.x release and 2.4.x release.}}
> {{PR for this fix pending.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32768) Add Parquet Timestamp output configuration to docs

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32768:


Assignee: Apache Spark

> Add Parquet Timestamp output configuration to docs
> --
>
> Key: SPARK-32768
> URL: https://issues.apache.org/jira/browse/SPARK-32768
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Ron DeFreitas
>Assignee: Apache Spark
>Priority: Minor
>  Labels: docs-missing, parquet
>
> {{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
> option for controlling the underlying datatype used when writing Timestamp 
> column types into parquet files.}}
> {{This was never exposed in the documentation. Fix should be applied to docs 
> for both to the next 3.x release and 2.4.x release.}}
> {{PR for this fix pending.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32767) Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188642#comment-17188642
 ] 

Apache Spark commented on SPARK-32767:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29612

> Bucket join should work if spark.sql.shuffle.partitions larger than bucket 
> number
> -
>
> Key: SPARK-32767
> URL: https://issues.apache.org/jira/browse/SPARK-32767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.range(1000).write.bucketBy(500, "id").saveAsTable("t1")
> spark.range(1000).write.bucketBy(50, "id").saveAsTable("t2")
> sql("set spark.sql.shuffle.partitions=600")
> sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> sql("select * from t1 join t2 on t1.id = t2.id").explain()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32767) Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32767:


Assignee: Apache Spark

> Bucket join should work if spark.sql.shuffle.partitions larger than bucket 
> number
> -
>
> Key: SPARK-32767
> URL: https://issues.apache.org/jira/browse/SPARK-32767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.range(1000).write.bucketBy(500, "id").saveAsTable("t1")
> spark.range(1000).write.bucketBy(50, "id").saveAsTable("t2")
> sql("set spark.sql.shuffle.partitions=600")
> sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> sql("select * from t1 join t2 on t1.id = t2.id").explain()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32767) Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

2020-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32767:


Assignee: (was: Apache Spark)

> Bucket join should work if spark.sql.shuffle.partitions larger than bucket 
> number
> -
>
> Key: SPARK-32767
> URL: https://issues.apache.org/jira/browse/SPARK-32767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.range(1000).write.bucketBy(500, "id").saveAsTable("t1")
> spark.range(1000).write.bucketBy(50, "id").saveAsTable("t2")
> sql("set spark.sql.shuffle.partitions=600")
> sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> sql("select * from t1 join t2 on t1.id = t2.id").explain()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32767) Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188639#comment-17188639
 ] 

Apache Spark commented on SPARK-32767:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29612

> Bucket join should work if spark.sql.shuffle.partitions larger than bucket 
> number
> -
>
> Key: SPARK-32767
> URL: https://issues.apache.org/jira/browse/SPARK-32767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.range(1000).write.bucketBy(500, "id").saveAsTable("t1")
> spark.range(1000).write.bucketBy(50, "id").saveAsTable("t2")
> sql("set spark.sql.shuffle.partitions=600")
> sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> sql("select * from t1 join t2 on t1.id = t2.id").explain()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32768) Add Parquet Timestamp output configuration to docs

2020-09-01 Thread Ron DeFreitas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron DeFreitas updated SPARK-32768:
--
Target Version/s:   (was: 2.4.7, 2.4.8, 3.0.1, 3.1.0, 3.0.2)

> Add Parquet Timestamp output configuration to docs
> --
>
> Key: SPARK-32768
> URL: https://issues.apache.org/jira/browse/SPARK-32768
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 
> 2.4.3, 2.4.4, 2.4.5, 2.4.6, 3.0.0
>Reporter: Ron DeFreitas
>Priority: Minor
>  Labels: docs-missing, parquet
>
> {{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
> option for controlling the underlying datatype used when writing Timestamp 
> column types into parquet files.}}
> {{This was never exposed in the documentation. Fix should be applied to docs 
> for both to the next 3.x release and 2.4.x release.}}
> {{PR for this fix pending.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32768) Add Parquet Timestamp output configuration to docs

2020-09-01 Thread Ron DeFreitas (Jira)
Ron DeFreitas created SPARK-32768:
-

 Summary: Add Parquet Timestamp output configuration to docs
 Key: SPARK-32768
 URL: https://issues.apache.org/jira/browse/SPARK-32768
 Project: Spark
  Issue Type: Documentation
  Components: docs
Affects Versions: 3.0.0, 2.4.6, 2.4.5, 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0, 
2.3.4, 2.3.3, 2.3.2, 2.3.1, 2.3.0
Reporter: Ron DeFreitas


{{Spark 2.3.0 added the spark.sql.parquet.outputTimestampType configuration 
option for controlling the underlying datatype used when writing Timestamp 
column types into parquet files.}}

{{This was never exposed in the documentation. Fix should be applied to docs 
for both to the next 3.x release and 2.4.x release.}}

{{PR for this fix pending.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32767) Bucket join should work if SHUFFLE_PARTITIONS larger than bucket number

2020-09-01 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-32767:
---

 Summary: Bucket join should work if SHUFFLE_PARTITIONS larger than 
bucket number
 Key: SPARK-32767
 URL: https://issues.apache.org/jira/browse/SPARK-32767
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


How to reproduce this issue:
{code:scala}
spark.range(1000).write.bucketBy(500, "id").saveAsTable("t1")
spark.range(1000).write.bucketBy(50, "id").saveAsTable("t2")
sql("set spark.sql.shuffle.partitions=600")
sql("set spark.sql.autoBroadcastJoinThreshold=-1")
sql("select * from t1 join t2 on t1.id = t2.id").explain()
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32767) Bucket join should work if spark.sql.shuffle.partitions larger than bucket number

2020-09-01 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32767:

Summary: Bucket join should work if spark.sql.shuffle.partitions larger 
than bucket number  (was: Bucket join should work if SHUFFLE_PARTITIONS larger 
than bucket number)

> Bucket join should work if spark.sql.shuffle.partitions larger than bucket 
> number
> -
>
> Key: SPARK-32767
> URL: https://issues.apache.org/jira/browse/SPARK-32767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.range(1000).write.bucketBy(500, "id").saveAsTable("t1")
> spark.range(1000).write.bucketBy(50, "id").saveAsTable("t2")
> sql("set spark.sql.shuffle.partitions=600")
> sql("set spark.sql.autoBroadcastJoinThreshold=-1")
> sql("select * from t1 join t2 on t1.id = t2.id").explain()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32760) Support for INET data type

2020-09-01 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188612#comment-17188612
 ] 

Ruslan Dautkhanov edited comment on SPARK-32760 at 9/1/20, 4:29 PM:


[~smilegator] understood. Would be great to consider separating logical and 
physical datatypes like it is done in Parquet for example. It might be easier 
to add higher-level / logical data types then? IPv4 address for example fits 
nicely into parquet's _INT64_ physical data type. Feel free to close if it's 
not feasible near-term. Thanks.


was (Author: tagar):
[~smilegator] understood. Would be great to consider separating logical and 
physical datatypes like it is done in Parquet for example. It might be easier 
to add higher-level data types then? IPv4 address for example fits nicely into 
parquet's _INT64_ data type. Feel free to close if it's not feasible near-term. 
Thanks.

> Support for INET data type
> --
>
> Key: SPARK-32760
> URL: https://issues.apache.org/jira/browse/SPARK-32760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> PostgreSQL has support for `INET` data type 
> [https://www.postgresql.org/docs/9.1/datatype-net-types.html]
> We have a few customers that are interested in similar, native support for IP 
> addresses, just like in PostgreSQL.
> The issue with storing IP addresses as strings, is that most of the matches 
> (like if an IP address belong to a subnet) in most cases can't take leverage 
> of parquet bloom filters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32760) Support for INET data type

2020-09-01 Thread Ruslan Dautkhanov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188612#comment-17188612
 ] 

Ruslan Dautkhanov commented on SPARK-32760:
---

[~smilegator] understood. Would be great to consider separating logical and 
physical datatypes like it is done in Parquet for example. It might be easier 
to add higher-level data types then? IPv4 address for example fits nicely into 
parquet's _INT64_ data type. Feel free to close if it's not feasible near-term. 
Thanks.

> Support for INET data type
> --
>
> Key: SPARK-32760
> URL: https://issues.apache.org/jira/browse/SPARK-32760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> PostgreSQL has support for `INET` data type 
> [https://www.postgresql.org/docs/9.1/datatype-net-types.html]
> We have a few customers that are interested in similar, native support for IP 
> addresses, just like in PostgreSQL.
> The issue with storing IP addresses as strings, is that most of the matches 
> (like if an IP address belong to a subnet) in most cases can't take leverage 
> of parquet bloom filters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28554) implement basic catalog functionalities

2020-09-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28554.
-
  Assignee: (was: Wenchen Fan)
Resolution: Duplicate

> implement basic catalog functionalities
> ---
>
> Key: SPARK-28554
> URL: https://issues.apache.org/jira/browse/SPARK-28554
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28554) implement basic catalog functionalities

2020-09-01 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188609#comment-17188609
 ] 

Wenchen Fan commented on SPARK-28554:
-

closed as it's done by other smaller PRs. See the sub-tasks of the parent 
ticket.

> implement basic catalog functionalities
> ---
>
> Key: SPARK-28554
> URL: https://issues.apache.org/jira/browse/SPARK-28554
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-28554) implement basic catalog functionalities

2020-09-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28554:

Comment: was deleted

(was: closed as it's done by other smaller PRs. See the sub-tasks of the parent 
ticket.)

> implement basic catalog functionalities
> ---
>
> Key: SPARK-28554
> URL: https://issues.apache.org/jira/browse/SPARK-28554
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28554) implement basic catalog functionalities

2020-09-01 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188610#comment-17188610
 ] 

Wenchen Fan commented on SPARK-28554:
-

closed as it's done by other smaller PRs. See the sub-tasks of the parent 
ticket.

> implement basic catalog functionalities
> ---
>
> Key: SPARK-28554
> URL: https://issues.apache.org/jira/browse/SPARK-28554
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32754) Unify to `assertEqualJoinPlans ` for join reorder suites

2020-09-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32754:
-

Assignee: Zhenhua Wang  (was: Apache Spark)

> Unify to `assertEqualJoinPlans ` for join reorder suites
> 
>
> Key: SPARK-32754
> URL: https://issues.apache.org/jira/browse/SPARK-32754
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Minor
> Fix For: 3.1.0
>
>
> Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, 
> `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and 
> the logic is almost the same. We can extract the method to a single place for 
> code simplicity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32754) Unify to `assertEqualJoinPlans ` for join reorder suites

2020-09-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32754:
--
Summary: Unify to `assertEqualJoinPlans ` for join reorder suites  (was: 
Unify `assertEqualPlans` for join reorder suites)

> Unify to `assertEqualJoinPlans ` for join reorder suites
> 
>
> Key: SPARK-32754
> URL: https://issues.apache.org/jira/browse/SPARK-32754
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
>
> Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, 
> `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and 
> the logic is almost the same. We can extract the method to a single place for 
> code simplicity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites

2020-09-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32754.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29594
[https://github.com/apache/spark/pull/29594]

> Unify `assertEqualPlans` for join reorder suites
> 
>
> Key: SPARK-32754
> URL: https://issues.apache.org/jira/browse/SPARK-32754
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
>
> Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, 
> `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and 
> the logic is almost the same. We can extract the method to a single place for 
> code simplicity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32766) s3a: bucket names with dots cannot be used

2020-09-01 Thread Ondrej Kokes (Jira)
Ondrej Kokes created SPARK-32766:


 Summary: s3a: bucket names with dots cannot be used
 Key: SPARK-32766
 URL: https://issues.apache.org/jira/browse/SPARK-32766
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.0.0
Reporter: Ondrej Kokes


Running vanilla spark with
{noformat}
--packages=org.apache.hadoop:hadoop-aws:x.y.z{noformat}
I cannot read from S3, if the bucket name contains a dot (a valid name).

A minimal reproducible example looks like this
{{from pyspark.sql import SparkSession}}
{{import pyspark.sql.functions as f}}
{{if __name__ == '__main__':}}
{{  spark = (SparkSession}}
{{    .builder}}
{{    .appName('my_app')}}
{{    .master("local[*]")}}
{{    .getOrCreate()}}
{{  )}}

{{  spark.read.csv("s3a://test-bucket-name-v1.0/foo.csv")}}

Or just launch a spark-shell with `--packages=(...)hadoop-aws(...)` and read 
that CSV. I created the same bucket without the period and it worked fine.

*Now I'm not sure whether this is a thing of prepping the path names and 
passing them to the aws-sdk, or whether the fault is within the SDK itself. I 
am not Java savvy to investigate the issue further, but I tried to make the 
repro as short as possible.*



I get different errors depending on which Hadoop distributions I use. If I use 
the default PySpark distribution (which includes Hadoop 2), I get the following 
(using hadoop-aws:2.7.4)

{{scala> spark.read.csv("s3a://okokes-test-v2.5/foo.csv").show()}}
{{java.lang.IllegalArgumentException: The bucketName parameter must be 
specified.}}
{{ at 
com.amazonaws.services.s3.AmazonS3Client.assertParameterNotNull(AmazonS3Client.java:2816)}}
{{ at 
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1026)}}
{{ at 
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)}}
{{ at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)}}
{{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)}}
{{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)}}
{{ at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)}}
{{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)}}
{{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)}}
{{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)}}
{{ at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
{{ at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
{{ at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
{{ at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
{{ at scala.Option.getOrElse(Option.scala:189)}}
{{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
{{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
{{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
{{ ... 47 elided}}

When I downloaded 3.0.0 with Hadoop 3 and ran a spark-shell there, I got this 
error (with hadoop-aws:3.2.0):

{{java.lang.NullPointerException: null uri host.}}
{{ at java.base/java.util.Objects.requireNonNull(Objects.java:246)}}
{{ at 
org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:71)}}
{{ at org.apache.hadoop.fs.s3a.S3AFileSystem.setUri(S3AFileSystem.java:470)}}
{{ at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:235)}}
{{ at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)}}
{{ at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)}}
{{ at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)}}
{{ at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)}}
{{ at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)}}
{{ at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)}}
{{ at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)}}
{{ at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)}}
{{ at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)}}
{{ at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)}}
{{ at scala.Option.getOrElse(Option.scala:189)}}
{{ at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)}}
{{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)}}
{{ at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:535)}}
{{ ... 47 elided}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns

2020-09-01 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32761.
-
Fix Version/s: 3.1.0
 Assignee: Liu, Linhong
   Resolution: Fixed

> Planner error when aggregating multiple distinct Constant columns
> -
>
> Key: SPARK-32761
> URL: https://issues.apache.org/jira/browse/SPARK-32761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Linhong Liu
>Assignee: Liu, Linhong
>Priority: Major
> Fix For: 3.1.0
>
>
> SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug.
> The problematic code is:
>  
> {code:java}
> val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e =>
>   val unfoldableChildren = 
> e.aggregateFunction.children.filter(!_.foldable).toSet
>   if (unfoldableChildren.nonEmpty) {
> // Only expand the unfoldable children
>  unfoldableChildren
>   } else {
> // If aggregateFunction's children are all foldable
> // we must expand at least one of the children (here we take the first 
> child),
> // or If we don't, we will get the wrong result, for example:
> // count(distinct 1) will be explained to count(1) after the rewrite 
> function.
> // Generally, the distinct aggregateFunction should not run
> // foldable TypeCheck for the first child.
> e.aggregateFunction.children.take(1).toSet
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32763) DiskStore can remove block from blockSizes even if block is not removed from disk

2020-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188405#comment-17188405
 ] 

Apache Spark commented on SPARK-32763:
--

User 'q2w' has created a pull request for this issue:
https://github.com/apache/spark/pull/29611

> DiskStore can remove block from blockSizes even if block is not removed from 
> disk
> -
>
> Key: SPARK-32763
> URL: https://issues.apache.org/jira/browse/SPARK-32763
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: abhishek kumar tiwari
>Priority: Minor
>
> DiskStore can remove a block from blockSizes even though the block could not 
> be removed from disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32108) Silent mode of spark-sql is broken

2020-09-01 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-32108.

Resolution: Not A Problem

> Silent mode of spark-sql is broken
> --
>
> Key: SPARK-32108
> URL: https://issues.apache.org/jira/browse/SPARK-32108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> 1. I download the recent release Spark 3.0 from 
> http://spark.apache.org/downloads.html
> 2. Run bin/spark-sql -S, it prints a lot of INFO
> {code}
> ➜  ~ ./spark-3.0/bin/spark-sql -S
> 20/06/26 20:43:38 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.hive.conf.HiveConf).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 20/06/26 20:43:39 INFO SharedState: spark.sql.warehouse.dir is not set, but 
> hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the 
> value of hive.metastore.warehouse.dir ('/user/hive/warehouse').
> 20/06/26 20:43:39 INFO SharedState: Warehouse path is '/user/hive/warehouse'.
> 20/06/26 20:43:39 INFO SessionState: Created HDFS directory: 
> /tmp/hive/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a
> 20/06/26 20:43:39 INFO SessionState: Created local directory: 
> /var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a
> 20/06/26 20:43:39 INFO SessionState: Created HDFS directory: 
> /tmp/hive/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a/_tmp_space.db
> 20/06/26 20:43:39 INFO SparkContext: Running Spark version 3.0.0
> 20/06/26 20:43:39 INFO ResourceUtils: 
> ==
> 20/06/26 20:43:39 INFO ResourceUtils: Resources for spark.driver:
> 20/06/26 20:43:39 INFO ResourceUtils: 
> ==
> 20/06/26 20:43:39 INFO SparkContext: Submitted application: 
> SparkSQL::192.168.1.78
> 20/06/26 20:43:39 INFO SecurityManager: Changing view acls to: maximgekk
> 20/06/26 20:43:39 INFO SecurityManager: Changing modify acls to: maximgekk
> 20/06/26 20:43:39 INFO SecurityManager: Changing view acls groups to:
> 20/06/26 20:43:39 INFO SecurityManager: Changing modify acls groups to:
> 20/06/26 20:43:39 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(maximgekk); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(maximgekk); groups with modify permissions: Set()
> 20/06/26 20:43:39 INFO Utils: Successfully started service 'sparkDriver' on 
> port 59414.
> 20/06/26 20:43:39 INFO SparkEnv: Registering MapOutputTracker
> 20/06/26 20:43:39 INFO SparkEnv: Registering BlockManagerMaster
> 20/06/26 20:43:39 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 20/06/26 20:43:39 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 20/06/26 20:43:39 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
> 20/06/26 20:43:39 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/blockmgr-c1d041ad-dd46-4d11-bbd0-e8ba27d3bf69
> 20/06/26 20:43:39 INFO MemoryStore: MemoryStore started with capacity 408.9 
> MiB
> 20/06/26 20:43:39 INFO SparkEnv: Registering OutputCommitCoordinator
> 20/06/26 20:43:40 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 20/06/26 20:43:40 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://192.168.1.78:4040
> 20/06/26 20:43:40 INFO Executor: Starting executor ID driver on host 
> 192.168.1.78
> 20/06/26 20:43:40 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59415.
> 20/06/26 20:43:40 INFO NettyBlockTransferService: Server created on 
> 192.168.1.78:59415
> 20/06/26 20:43:40 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 20/06/26 20:43:40 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(driver, 192.168.1.78, 59415, None)
> 20/06/26 20:43:40 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.1.78:59415 with 408.9 MiB RAM, BlockManagerId(driver, 192.168.1.78, 
> 59415, None)
> 20/06/26 20:43:40 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, 192.168.1.78, 59415, None)
> 20/06/26 20:43:40 INFO BlockManager: Initialized BlockManager: 
> 

[jira] [Created] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-01 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32765:
---

 Summary: EliminateJoinToEmptyRelation should respect exchange 
behavior when canChangeNumPartitions == false
 Key: SPARK-32765
 URL: https://issues.apache.org/jira/browse/SPARK-32765
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: Currently, EliminateJoinToEmptyRelation Rule will convert 
Join into EmptyRelation in some cases with AQE on. But if either sub plan of 
Join contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
the Exchange produced by repartition Or singlePartition, in this case, if we 
were to convert it into an EmptyRelation, it will lost user specified number 
partition information for downstream operator, it's not right. 

So in the Patch, try not to do the conversion if either sub plan of Join 
contains ShuffleQueryStage(canChangeNumPartitions == false)
Reporter: Leanken.Lin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32637) SPARK SQL JDBC truncates last value of seconds for datetime2 values for Azure SQL DB

2020-09-01 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188384#comment-17188384
 ] 

Maxim Gekk commented on SPARK-32637:


Spark's TIMESTAMP type has microsecond precision. This is by design and it is 
not a bug.

> SPARK SQL JDBC truncates last value of seconds for datetime2 values for Azure 
> SQL DB 
> -
>
> Key: SPARK-32637
> URL: https://issues.apache.org/jira/browse/SPARK-32637
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Mohit Dave
>Priority: Major
>
> SPARK jdbc is truncating TIMESTAMP values for the microsecond when datetime2 
> datatype is used for Microsoft SQL Server JDBC driver.
>  
> Source data(datetime2) : '2007-08-08 12:35:29.1234567'
>  
> After loading to target using SPARK dataframes
>  
> Target data(datetime2) : '2007-08-08 12:35:29.1234560'
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >