[jira] [Assigned] (SPARK-37202) Temp view didn't collect temp function that registered with catalog API

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37202:


Assignee: (was: Apache Spark)

> Temp view didn't collect temp function that registered with catalog API
> ---
>
> Key: SPARK-37202
> URL: https://issues.apache.org/jira/browse/SPARK-37202
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37202) Temp view didn't collect temp function that registered with catalog API

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37202:


Assignee: Apache Spark

> Temp view didn't collect temp function that registered with catalog API
> ---
>
> Key: SPARK-37202
> URL: https://issues.apache.org/jira/browse/SPARK-37202
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37202) Temp view didn't collect temp function that registered with catalog API

2021-11-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437730#comment-17437730
 ] 

Apache Spark commented on SPARK-37202:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/34473

> Temp view didn't collect temp function that registered with catalog API
> ---
>
> Key: SPARK-37202
> URL: https://issues.apache.org/jira/browse/SPARK-37202
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37202) Temp view didn't collect temp function that registered with catalog API

2021-11-02 Thread Linhong Liu (Jira)
Linhong Liu created SPARK-37202:
---

 Summary: Temp view didn't collect temp function that registered 
with catalog API
 Key: SPARK-37202
 URL: https://issues.apache.org/jira/browse/SPARK-37202
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Linhong Liu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24774) support reading AVRO logical types - Decimal

2021-11-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437714#comment-17437714
 ] 

Apache Spark commented on SPARK-24774:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34472

> support reading AVRO logical types - Decimal
> 
>
> Key: SPARK-24774
> URL: https://issues.apache.org/jira/browse/SPARK-24774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24774) support reading AVRO logical types - Decimal

2021-11-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437713#comment-17437713
 ] 

Apache Spark commented on SPARK-24774:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34472

> support reading AVRO logical types - Decimal
> 
>
> Key: SPARK-24774
> URL: https://issues.apache.org/jira/browse/SPARK-24774
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37191) Allow merging DecimalTypes with different precision values

2021-11-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37191.
-
Resolution: Fixed

Issue resolved by pull request 34462
[https://github.com/apache/spark/pull/34462]

> Allow merging DecimalTypes with different precision values 
> ---
>
> Key: SPARK-37191
> URL: https://issues.apache.org/jira/browse/SPARK-37191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0
>Reporter: Ivan
>Assignee: Ivan
>Priority: Major
> Fix For: 3.3.0
>
>
> When merging DecimalTypes with different precision but the same scale, one 
> would get the following error:
> {code:java}
> Failed to merge fields 'col' and 'col'. Failed to merge decimal types with 
> incompatible precision 17 and 12   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641)
>   at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) 
> {code}
>  
> We could allow merging DecimalType values with different precision if the 
> scale is the same for both types since there should not be any data 
> correctness issues as one of the types will be extended, for example, 
> DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting 
> when the scale is different - this would depend on the actual values.
>  
> Repro code:
> {code:java}
> import org.apache.spark.sql.types._
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> schema1.merge(schema2) {code}
>  
> This also affects Parquet schema merge which is where this issue was 
> discovered originally:
> {code:java}
> import java.math.BigDecimal
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> spark.createDataFrame(data2, 
> schema2).write.parquet("/tmp/decimal-test.parquet")
> spark.createDataFrame(data1, 
> schema1).write.mode("append").parquet("/tmp/decimal-test.parquet")
> // Reading the DataFrame fails
> spark.read.option("mergeSchema", 
> "true").parquet("/tmp/decimal-test.parquet").show()
> >>>
> Failed merging schema:
> root
>  |-- col: decimal(17,2) (nullable = true)
> Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal 
> types with incompatible precision 12 and 17
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37191) Allow merging DecimalTypes with different precision values

2021-11-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37191:
---

Assignee: Ivan

> Allow merging DecimalTypes with different precision values 
> ---
>
> Key: SPARK-37191
> URL: https://issues.apache.org/jira/browse/SPARK-37191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.0, 3.1.1, 3.2.0
>Reporter: Ivan
>Assignee: Ivan
>Priority: Major
> Fix For: 3.3.0
>
>
> When merging DecimalTypes with different precision but the same scale, one 
> would get the following error:
> {code:java}
> Failed to merge fields 'col' and 'col'. Failed to merge decimal types with 
> incompatible precision 17 and 12   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:652)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:644)
>   at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted(StructType.scala:641)
>   at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>   at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>   at org.apache.spark.sql.types.StructType$.merge(StructType.scala:641)
>   at org.apache.spark.sql.types.StructType.merge(StructType.scala:550) 
> {code}
>  
> We could allow merging DecimalType values with different precision if the 
> scale is the same for both types since there should not be any data 
> correctness issues as one of the types will be extended, for example, 
> DECIMAL(12, 2) -> DECIMAL(17, 2); however, this is not the case for upcasting 
> when the scale is different - this would depend on the actual values.
>  
> Repro code:
> {code:java}
> import org.apache.spark.sql.types._
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> schema1.merge(schema2) {code}
>  
> This also affects Parquet schema merge which is where this issue was 
> discovered originally:
> {code:java}
> import java.math.BigDecimal
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val data1 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema1 = StructType(StructField("col", DecimalType(17, 2)) :: Nil)
> val data2 = sc.parallelize(Row(new BigDecimal("123456789.11")) :: Nil, 1)
> val schema2 = StructType(StructField("col", DecimalType(12, 2)) :: Nil)
> spark.createDataFrame(data2, 
> schema2).write.parquet("/tmp/decimal-test.parquet")
> spark.createDataFrame(data1, 
> schema1).write.mode("append").parquet("/tmp/decimal-test.parquet")
> // Reading the DataFrame fails
> spark.read.option("mergeSchema", 
> "true").parquet("/tmp/decimal-test.parquet").show()
> >>>
> Failed merging schema:
> root
>  |-- col: decimal(17,2) (nullable = true)
> Caused by: Failed to merge fields 'col' and 'col'. Failed to merge decimal 
> types with incompatible precision 12 and 17
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32567) Code-gen for full outer shuffled hash join

2021-11-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32567.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 3
[https://github.com/apache/spark/pull/3]

> Code-gen for full outer shuffled hash join
> --
>
> Key: SPARK-32567
> URL: https://issues.apache.org/jira/browse/SPARK-32567
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> As a followup for [https://github.com/apache/spark/pull/29342] (non-codegen 
> full outer shuffled hash join), this task is to add code-gen for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32567) Code-gen for full outer shuffled hash join

2021-11-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32567:
---

Assignee: Cheng Su

> Code-gen for full outer shuffled hash join
> --
>
> Key: SPARK-32567
> URL: https://issues.apache.org/jira/browse/SPARK-32567
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
>
> As a followup for [https://github.com/apache/spark/pull/29342] (non-codegen 
> full outer shuffled hash join), this task is to add code-gen for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

2021-11-02 Thread Sergey Kotlov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Kotlov updated SPARK-37201:
--
Summary: Spark SQL reads unnecessary nested fields (filter after explode)  
(was: Spark SQL reads unnecсessary nested fields (filter after explode))

> Spark SQL reads unnecessary nested fields (filter after explode)
> 
>
> Key: SPARK-37201
> URL: https://issues.apache.org/jira/browse/SPARK-37201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sergey Kotlov
>Priority: Major
>
> In this example, reading unnecessary nested fields still happens.
> Data preparation:
>  
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, array: Seq[String])
> Seq(
>   Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
> ).toDF().write.mode("overwrite").saveAsTable("table")
> {code}
>  
> v2 and v3 columns aren't needed here, but still exist in the physical plan.
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>  
> == Physical Plan ==
> ... ReadSchema: 
> struct,array:array>
> {code}
> If you just remove _filter_ or move _explode_ to second _select_, everything 
> is fine:
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   //.filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> spark.table("table")
>   .select($"struct.v1", $"array")
>   .select($"v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37201) Spark SQL reads unnecсessary nested fields (filter after explode)

2021-11-02 Thread Sergey Kotlov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Kotlov updated SPARK-37201:
--
Summary: Spark SQL reads unnecсessary nested fields (filter after explode)  
(was: Spark SQL reads unnecessary nested fields (filter after explode))

> Spark SQL reads unnecсessary nested fields (filter after explode)
> -
>
> Key: SPARK-37201
> URL: https://issues.apache.org/jira/browse/SPARK-37201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sergey Kotlov
>Priority: Major
>
> In this example, reading unnecessary nested fields still happens.
> Data preparation:
>  
> {code:java}
> case class Struct(v1: String, v2: String, v3: String)
> case class Event(struct: Struct, array: Seq[String])
> Seq(
>   Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
> ).toDF().write.mode("overwrite").saveAsTable("table")
> {code}
>  
> v2 and v3 columns aren't needed here, but still exist in the physical plan.
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>  
> == Physical Plan ==
> ... ReadSchema: 
> struct,array:array>
> {code}
> If you just remove _filter_ or move _explode_ to second _select_, everything 
> is fine:
> {code:java}
> spark.table("table")
>   .select($"struct.v1", explode($"array").as("el"))
>   //.filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> spark.table("table")
>   .select($"struct.v1", $"array")
>   .select($"v1", explode($"array").as("el"))
>   .filter($"el" === "cx1")
>   .explain(true)
>   
> // ... ReadSchema: struct,array:array>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37201) Spark SQL reads unnecessary nested fields (filter after explode)

2021-11-02 Thread Sergey Kotlov (Jira)
Sergey Kotlov created SPARK-37201:
-

 Summary: Spark SQL reads unnecessary nested fields (filter after 
explode)
 Key: SPARK-37201
 URL: https://issues.apache.org/jira/browse/SPARK-37201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Sergey Kotlov


In this example, reading unnecessary nested fields still happens.

Data preparation:

 
{code:java}
case class Struct(v1: String, v2: String, v3: String)
case class Event(struct: Struct, array: Seq[String])

Seq(
  Event(Struct("v1","v2","v3"), Seq("cx1", "cx2"))
).toDF().write.mode("overwrite").saveAsTable("table")
{code}
 

v2 and v3 columns aren't needed here, but still exist in the physical plan.
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
 
== Physical Plan ==
... ReadSchema: 
struct,array:array>

{code}
If you just remove _filter_ or move _explode_ to second _select_, everything is 
fine:
{code:java}
spark.table("table")
  .select($"struct.v1", explode($"array").as("el"))
  //.filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>

spark.table("table")
  .select($"struct.v1", $"array")
  .select($"v1", explode($"array").as("el"))
  .filter($"el" === "cx1")
  .explain(true)
  
// ... ReadSchema: struct,array:array>
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36935) Enhance ParquetSchemaConverter to capture Parquet repetition & definition level

2021-11-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36935.
---
Fix Version/s: 3.3.0
 Assignee: Chao Sun
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/34199

> Enhance ParquetSchemaConverter to capture Parquet repetition & definition 
> level
> ---
>
> Key: SPARK-36935
> URL: https://issues.apache.org/jira/browse/SPARK-36935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0
>
>
> In order to support complex type for Parquet vectorized reader, we'll need to 
> capture the repetition & definition level information associated with 
> Catalyst Spark type converted from Parquet {{MessageType}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36879:


Assignee: (was: Apache Spark)

> Support Parquet v2 data page encodings for the vectorized path
> --
>
> Key: SPARK-36879
> URL: https://issues.apache.org/jira/browse/SPARK-36879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark only support Parquet V1 encodings (i.e., 
> PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise:
> {code}
> java.lang.UnsupportedOperationException: Unsupported encoding: 
> DELTA_BYTE_ARRAY
> {code}
> It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, 
> DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as 
> listed in https://github.com/apache/parquet-format/blob/master/Encodings.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path

2021-11-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437628#comment-17437628
 ] 

Apache Spark commented on SPARK-36879:
--

User 'parthchandra' has created a pull request for this issue:
https://github.com/apache/spark/pull/34471

> Support Parquet v2 data page encodings for the vectorized path
> --
>
> Key: SPARK-36879
> URL: https://issues.apache.org/jira/browse/SPARK-36879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark only support Parquet V1 encodings (i.e., 
> PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise:
> {code}
> java.lang.UnsupportedOperationException: Unsupported encoding: 
> DELTA_BYTE_ARRAY
> {code}
> It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, 
> DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as 
> listed in https://github.com/apache/parquet-format/blob/master/Encodings.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36879:


Assignee: Apache Spark

> Support Parquet v2 data page encodings for the vectorized path
> --
>
> Key: SPARK-36879
> URL: https://issues.apache.org/jira/browse/SPARK-36879
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark only support Parquet V1 encodings (i.e., 
> PLAIN/DICTIONARY/RLE) in the vectorized path, and throws exception otherwise:
> {code}
> java.lang.UnsupportedOperationException: Unsupported encoding: 
> DELTA_BYTE_ARRAY
> {code}
> It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, 
> DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as 
> listed in https://github.com/apache/parquet-format/blob/master/Encodings.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37200) Drop index support

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37200:


Assignee: (was: Apache Spark)

> Drop index support
> --
>
> Key: SPARK-37200
> URL: https://issues.apache.org/jira/browse/SPARK-37200
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37200) Drop index support

2021-11-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437597#comment-17437597
 ] 

Apache Spark commented on SPARK-37200:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34469

> Drop index support
> --
>
> Key: SPARK-37200
> URL: https://issues.apache.org/jira/browse/SPARK-37200
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37200) Drop index support

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37200:


Assignee: Apache Spark

> Drop index support
> --
>
> Key: SPARK-37200
> URL: https://issues.apache.org/jira/browse/SPARK-37200
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37200) Drop index support

2021-11-02 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-37200:
--

 Summary: Drop index support
 Key: SPARK-37200
 URL: https://issues.apache.org/jira/browse/SPARK-37200
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker

2021-11-02 Thread mathieu longtin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437585#comment-17437585
 ] 

mathieu longtin commented on SPARK-37185:
-

It seems to try to optimize for a simple query, but not more complex queries. 
It kind of make sense for "select * from t", but any where clause can make it 
quite restrictive.

It looks like it scans the first part, doesn't find enough data, then scans 
four parts, then decides to scan everything. This is nice, but meanwhile, I 
have 20 workers already reserved, it wouldn't cost anything more to just go 
ahead right away.

Timing, table is not cached, contains 69 csv.gz files with anywhere from 1MB to 
2.2GB of data:
{code:java}
In [1]: %time spark.sql("select * from t where x = 99").take(10)
CPU times: user 83.9 ms, sys: 112 ms, total: 196 ms
Wall time: 6min 44s
...
In [2]: %time spark.sql("select * from t where x = 99").limit(10).rdd.collect()
CPU times: user 45.7 ms, sys: 73.9 ms, total: 120 ms
Wall time: 3min 59s
...


{code}
I ran the two tests a few times to make sure there was no OS level caching 
effect, the timing didn't change much.

If I cache the table first, then "take(10)" is faster than 
"limit(10).rdd.collect()".

> DataFrame.take() only uses one worker
> -
>
> Key: SPARK-37185
> URL: https://issues.apache.org/jira/browse/SPARK-37185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
> Environment: CentOS 7
>Reporter: mathieu longtin
>Priority: Major
>
> Say you have query:
> {code:java}
> >>> df = spark.sql("select * from mytable where x = 99"){code}
> Now, out of billions of row, there's only ten rows where x is 99.
> If I do:
> {code:java}
> >>> df.limit(10).collect()
> [Stage 1:>  (0 + 1) / 1]{code}
> It only uses one worker. This takes a really long time since one CPU is 
> reading the billions of row.
> However, if I do this:
> {code:java}
> >>> df.limit(10).rdd.collect()
> [Stage 1:>  (0 + 10) / 22]{code}
> All the workers are running.
> I think there's some optimization issue DataFrame.take(...).
> This did not use to be an issue, but I'm not sure if it was working with 3.0 
> or 2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437527#comment-17437527
 ] 

Apache Spark commented on SPARK-37199:
--

User 'somani' has created a pull request for this issue:
https://github.com/apache/spark/pull/34470

> Add a deterministic field to QueryPlan
> --
>
> Key: SPARK-37199
> URL: https://issues.apache.org/jira/browse/SPARK-37199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> We have a _deterministic_ field in 
> [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
>  to check if an expression is deterministic, but we do not have a similar 
> field in 
> [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]
> We have a need for such a check in the QueryPlan sometimes, like in 
> [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]
> This proposal is to add a _deterministic_ field to QueryPlan.
> More details [in this 
> document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37199:


Assignee: Apache Spark

> Add a deterministic field to QueryPlan
> --
>
> Key: SPARK-37199
> URL: https://issues.apache.org/jira/browse/SPARK-37199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Assignee: Apache Spark
>Priority: Major
>
> We have a _deterministic_ field in 
> [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
>  to check if an expression is deterministic, but we do not have a similar 
> field in 
> [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]
> We have a need for such a check in the QueryPlan sometimes, like in 
> [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]
> This proposal is to add a _deterministic_ field to QueryPlan.
> More details [in this 
> document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37199:


Assignee: (was: Apache Spark)

> Add a deterministic field to QueryPlan
> --
>
> Key: SPARK-37199
> URL: https://issues.apache.org/jira/browse/SPARK-37199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> We have a _deterministic_ field in 
> [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
>  to check if an expression is deterministic, but we do not have a similar 
> field in 
> [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]
> We have a need for such a check in the QueryPlan sometimes, like in 
> [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]
> This proposal is to add a _deterministic_ field to QueryPlan.
> More details [in this 
> document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Abhishek Somani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated SPARK-37199:

Description: 
We have a _deterministic_ field in 
[Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
 to check if an expression is deterministic, but we do not have a similar field 
in 
[QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]

We have a need for such a check in the QueryPlan sometimes, like in 
[InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]

This proposal is to add a _deterministic_ field to QueryPlan.

More details[ in this 
document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].

  was:
We have a _deterministic_ field in 
[Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
 to check if an expression is deterministic, but we do not have a similar field 
in 
[QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]

We have a need for such a check in the QueryPlan sometimes, like in 
[InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]

This proposal is to add a _deterministic_ field to QueryPlan.

More details in this document.


> Add a deterministic field to QueryPlan
> --
>
> Key: SPARK-37199
> URL: https://issues.apache.org/jira/browse/SPARK-37199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> We have a _deterministic_ field in 
> [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
>  to check if an expression is deterministic, but we do not have a similar 
> field in 
> [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]
> We have a need for such a check in the QueryPlan sometimes, like in 
> [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]
> This proposal is to add a _deterministic_ field to QueryPlan.
> More details[ in this 
> document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37197) PySpark pandas recent issues from chconnell

2021-11-02 Thread Chuck Connell (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chuck Connell updated SPARK-37197:
--
Description: 
SPARK-37180  PySpark.pandas should support __version__

SPARK-37181  pyspark.pandas.read_csv() should support latin-1 encoding
  
 SPARK-37183  pyspark.pandas.DataFrame.map() should support .fillna()
  
 SPARK-37184  pyspark.pandas should support 
DF["column"].str.split("some_suffix").str[0]
 SPARK-37186  pyspark.pandas should support tseries.offsets 

SPARK-37187  pyspark.pandas fails to create a histogram of one column from a 
large DataFrame

SPARK-37188  pyspark.pandas histogram accepts the title option but does not add 
a title to the plot

SPARK-37189  pyspark.pandas histogram accepts the range option but does not use 
it

SPARK-37198  pyspark.pandas read_csv() and to_csv() should handle local files

 

  was:
SPARK-37180  PySpark.pandas should support __version__

SPARK-37181  pyspark.pandas.read_csv() should support latin-1 encoding
 
SPARK-37183  pyspark.pandas.DataFrame.map() should support .fillna()
 
SPARK-37184  pyspark.pandas should support 
DF["column"].str.split("some_suffix").str[0]
SPARK-37186  pyspark.pandas should support tseries.offsets 

SPARK-37187  pyspark.pandas fails to create a histogram of one column from a 
large DataFrame

SPARK-37188  pyspark.pandas histogram accepts the title option but does not add 
a title to the plot

SPARK-37189  pyspark.pandas histogram accepts the range option but does not use 
it

 


> PySpark pandas recent issues from chconnell
> ---
>
> Key: SPARK-37197
> URL: https://issues.apache.org/jira/browse/SPARK-37197
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> SPARK-37180  PySpark.pandas should support __version__
> SPARK-37181  pyspark.pandas.read_csv() should support latin-1 encoding
>   
>  SPARK-37183  pyspark.pandas.DataFrame.map() should support .fillna()
>   
>  SPARK-37184  pyspark.pandas should support 
> DF["column"].str.split("some_suffix").str[0]
>  SPARK-37186  pyspark.pandas should support tseries.offsets 
> SPARK-37187  pyspark.pandas fails to create a histogram of one column from a 
> large DataFrame
> SPARK-37188  pyspark.pandas histogram accepts the title option but does not 
> add a title to the plot
> SPARK-37189  pyspark.pandas histogram accepts the range option but does not 
> use it
> SPARK-37198  pyspark.pandas read_csv() and to_csv() should handle local files
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Abhishek Somani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437514#comment-17437514
 ] 

Abhishek Somani commented on SPARK-37199:
-

Will add a PR soon.

> Add a deterministic field to QueryPlan
> --
>
> Key: SPARK-37199
> URL: https://issues.apache.org/jira/browse/SPARK-37199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> We have a _deterministic_ field in 
> [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
>  to check if an expression is deterministic, but we do not have a similar 
> field in 
> [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]
> We have a need for such a check in the QueryPlan sometimes, like in 
> [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]
> This proposal is to add a _deterministic_ field to QueryPlan.
> More details [in this 
> document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37198) pyspark.pandas read_csv() and to_csv() should handle local files

2021-11-02 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37198:
-

 Summary: pyspark.pandas read_csv() and to_csv() should handle 
local files 
 Key: SPARK-37198
 URL: https://issues.apache.org/jira/browse/SPARK-37198
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


Pandas programmers who move their code to Spark would like to import and export 
text files to and from their local disk. I know there are technical hurdles to 
this (since Spark is usually in a cluster that does not know where your local 
computer is) but it would really help code migration. 

For read_csv() and to_csv(), the syntax {{*file://c:/Temp/my_file.csv* }}(or 
something like this) should import and export to the local disk on Windows. 
Similarly for Mac and Linux. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Abhishek Somani (Jira)
Abhishek Somani created SPARK-37199:
---

 Summary: Add a deterministic field to QueryPlan
 Key: SPARK-37199
 URL: https://issues.apache.org/jira/browse/SPARK-37199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Abhishek Somani


We have a _deterministic_ field in 
[Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
 to check if an expression is deterministic, but we do not have a similar field 
in 
[QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]

We have a need for such a check in the QueryPlan sometimes, like in 
[InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]

This proposal is to add a _deterministic_ field to QueryPlan.

More details in this document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37199) Add a deterministic field to QueryPlan

2021-11-02 Thread Abhishek Somani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Somani updated SPARK-37199:

Description: 
We have a _deterministic_ field in 
[Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
 to check if an expression is deterministic, but we do not have a similar field 
in 
[QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]

We have a need for such a check in the QueryPlan sometimes, like in 
[InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]

This proposal is to add a _deterministic_ field to QueryPlan.

More details [in this 
document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].

  was:
We have a _deterministic_ field in 
[Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
 to check if an expression is deterministic, but we do not have a similar field 
in 
[QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]

We have a need for such a check in the QueryPlan sometimes, like in 
[InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]

This proposal is to add a _deterministic_ field to QueryPlan.

More details[ in this 
document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].


> Add a deterministic field to QueryPlan
> --
>
> Key: SPARK-37199
> URL: https://issues.apache.org/jira/browse/SPARK-37199
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Abhishek Somani
>Priority: Major
>
> We have a _deterministic_ field in 
> [Expressions|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L115]
>  to check if an expression is deterministic, but we do not have a similar 
> field in 
> [QueryPlan.|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L44]
> We have a need for such a check in the QueryPlan sometimes, like in 
> [InlineCTE|https://github.com/apache/spark/blob/b78167a2ee6b11b1f2839274e23676411f919115/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala#L56]
> This proposal is to add a _deterministic_ field to QueryPlan.
> More details [in this 
> document|https://docs.google.com/document/d/1eIiaSJf-Co2HhjsaQxFNGwUxobnHID4ZGmJMcVytREc/edit#].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35496) Upgrade Scala 2.13 to 2.13.7

2021-11-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35496:
--
Affects Version/s: (was: 3.2.0)

> Upgrade Scala 2.13 to 2.13.7
> 
>
> Key: SPARK-35496
> URL: https://issues.apache.org/jira/browse/SPARK-35496
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> This issue aims to upgrade to Scala 2.13.7.
> Scala 2.13.6 released(https://github.com/scala/scala/releases/tag/v2.13.6). 
> However, we skip 2.13.6 because there is a breaking behavior change at 2.13.6 
> which is different from both Scala 2.13.5 and Scala 3.
> - https://github.com/scala/bug/issues/12403
> {code}
> scala3-3.0.0:$ bin/scala
> scala> Array.empty[Double].intersect(Array(0.0))
> val res0: Array[Double] = Array()
> scala-2.13.6:$ bin/scala
> Welcome to Scala 2.13.6 (OpenJDK 64-Bit Server VM, Java 1.8.0_292).
> Type in expressions for evaluation. Or try :help.
> scala> Array.empty[Double].intersect(Array(0.0))
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [D
>   ... 32 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37066) Improve ORC RecordReader's error message

2021-11-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-37066:


Assignee: angerszhu

> Improve ORC RecordReader's error message
> 
>
> Key: SPARK-37066
> URL: https://issues.apache.org/jira/browse/SPARK-37066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
>
> In below error message, we don't know the actual file path
> {code}
> 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID 
> 257)
> java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517)
>   at 
> org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37066) Improve ORC RecordReader's error message

2021-11-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37066.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34337
[https://github.com/apache/spark/pull/34337]

> Improve ORC RecordReader's error message
> 
>
> Key: SPARK-37066
> URL: https://issues.apache.org/jira/browse/SPARK-37066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.3.0
>
>
> In below error message, we don't know the actual file path
> {code}
> 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID 
> 257)
> java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517)
>   at 
> org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37066) Improve ORC RecordReader's error message

2021-11-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-37066:
-
Priority: Minor  (was: Major)

> Improve ORC RecordReader's error message
> 
>
> Key: SPARK-37066
> URL: https://issues.apache.org/jira/browse/SPARK-37066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Minor
>
> In below error message, we don't know the actual file path
> {code}
> 21/10/19 18:12:58 ERROR Executor: Exception in task 34.1 in stage 14.0 (TID 
> 257)
> java.lang.ArrayIndexOutOfBoundsException: 1024
>   at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:292)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:1820)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1517)
>   at 
> org.apache.orc.impl.ConvertTreeReaderFactory$DateFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:1802)
>   at 
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2059)
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1324)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:196)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
>   at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>   at org.apache.spark.scheduler.Task.run(Task.scala:131)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437394#comment-17437394
 ] 

Nicholas Chammas edited comment on SPARK-24853 at 11/2/21, 2:41 PM:


[~hyukjin.kwon] - It's not just for consistency. As noted in the description, 
this is useful when you are trying to rename a column with an ambiguous name.

For example, imagine two tables {{left}} and {{right}}, each with a column 
called {{count}}:
{code:python}
(
  left_counts.alias('left')
  .join(right_counts.alias('right'), on='join_key')
  .withColumn(
'total_count',
left_counts['count'] + right_counts['count']
  )
  .withColumnRenamed('left.count', 'left_count')  # no-op; alias doesn't work
  .withColumnRenamed('count', 'left_count')  # incorrect; it renames both count 
columns
  .withColumnRenamed(left_counts['count'], 'left_count')  # what, ideally, 
users want to do here
  .show()
){code}
If you don't mind, I'm going to reopen this issue.


was (Author: nchammas):
[~hyukjin.kwon] - It's not just for consistency. As noted in the description, 
this is useful when you are trying to rename a column with an ambiguous name.

For example, imagine two tables {{left}} and {{right}}, each with a column 
called {{count}}:
{code:java}
(
  left_counts.alias('left')
  .join(right_counts.alias('right'), on='join_key')
  .withColumn(
'total_count',
left_counts['count'] + right_counts['count']
  )
  .withColumnRenamed('left.count', 'left_count')  # no-op; alias doesn't work
  .withColumnRenamed('count', 'left_count')  # incorrect; it renames both count 
columns
  .withColumnRenamed(left_counts['count'], 'left_count')  # what, ideally, 
users want to do here
  .show()
){code}
If you don't mind, I'm going to reopen this issue.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437396#comment-17437396
 ] 

Nicholas Chammas commented on SPARK-24853:
--

The [contributing guide|https://spark.apache.org/contributing.html] isn't clear 
on how to populate "Affects Version" for improvements, so I've just tagged the 
latest release.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-24853:
-
Priority: Minor  (was: Major)

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas reopened SPARK-24853:
--

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437394#comment-17437394
 ] 

Nicholas Chammas commented on SPARK-24853:
--

[~hyukjin.kwon] - It's not just for consistency. As noted in the description, 
this is useful when you are trying to rename a column with an ambiguous name.

For example, imagine two tables {{left}} and {{right}}, each with a column 
called {{count}}:
{code:java}
(
  left_counts.alias('left')
  .join(right_counts.alias('right'), on='join_key')
  .withColumn(
'total_count',
left_counts['count'] + right_counts['count']
  )
  .withColumnRenamed('left.count', 'left_count')  # no-op; alias doesn't work
  .withColumnRenamed('count', 'left_count')  # incorrect; it renames both count 
columns
  .withColumnRenamed(left_counts['count'], 'left_count')  # what, ideally, 
users want to do here
  .show()
){code}
If you don't mind, I'm going to reopen this issue.

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2
>Reporter: nirav patel
>Priority: Major
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis

2021-11-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-24853:
-
Affects Version/s: 3.2.0

> Support Column type for withColumn and withColumnRenamed apis
> -
>
> Key: SPARK-24853
> URL: https://issues.apache.org/jira/browse/SPARK-24853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.2, 3.2.0
>Reporter: nirav patel
>Priority: Minor
>
> Can we add overloaded version of withColumn or withColumnRenamed that accept 
> Column type instead of String? That way I can specify FQN in case when there 
> is duplicate column names. e.g. if I have 2 columns with same name as a 
> result of join and I want to rename one of the field I can do it with this 
> new API.
>  
> This would be similar to Drop api which supports both String and Column type.
>  
> def
> withColumn(colName: Column, col: Column): DataFrame
> Returns a new Dataset by adding a column or replacing the existing column 
> that has the same name.
>  
> def
> withColumnRenamed(existingName: Column, newName: Column): DataFrame
> Returns a new Dataset with a column renamed.
>  
>  
>  
> I think there should also be this one:
>  
> def
> withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame
> Returns a new Dataset with a column renamed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-02 Thread Chuck Connell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437329#comment-17437329
 ] 

Chuck Connell edited comment on SPARK-37189 at 11/2/21, 1:18 PM:
-

Ok, will do. 


was (Author: chconnell):
Ok, will do. FYI, getting covid shot today, so I may be tired for a few days.

> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- 
> DeathsPer100k (<20)")
> {quote}
> it compiles and runs, but the plot does not respect the range. All the values 
> are shown.
> The workaround is to create a new DataFrame that pre-selects just the rows 
> you want, but line above should work also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37197) PySpark pandas recent issues from chconnell

2021-11-02 Thread Chuck Connell (Jira)
Chuck Connell created SPARK-37197:
-

 Summary: PySpark pandas recent issues from chconnell
 Key: SPARK-37197
 URL: https://issues.apache.org/jira/browse/SPARK-37197
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Chuck Connell


SPARK-37180  PySpark.pandas should support __version__

SPARK-37181  pyspark.pandas.read_csv() should support latin-1 encoding
 
SPARK-37183  pyspark.pandas.DataFrame.map() should support .fillna()
 
SPARK-37184  pyspark.pandas should support 
DF["column"].str.split("some_suffix").str[0]
SPARK-37186  pyspark.pandas should support tseries.offsets 

SPARK-37187  pyspark.pandas fails to create a histogram of one column from a 
large DataFrame

SPARK-37188  pyspark.pandas histogram accepts the title option but does not add 
a title to the plot

SPARK-37189  pyspark.pandas histogram accepts the range option but does not use 
it

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-02 Thread Chuck Connell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437329#comment-17437329
 ] 

Chuck Connell commented on SPARK-37189:
---

Ok, will do. FYI, getting covid shot today, so I may be tired for a few days.

> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- 
> DeathsPer100k (<20)")
> {quote}
> it compiles and runs, but the plot does not respect the range. All the values 
> are shown.
> The workaround is to create a new DataFrame that pre-selects just the rows 
> you want, but line above should work also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-11-02 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437259#comment-17437259
 ] 

Steve Loughran commented on SPARK-23977:


[~gumartinm] can I draw your attention to Apache Iceberg?

meanwhile
MAPREDUCE-7341 adds a high performance targeting abfs and gcs; all tree 
scanning is in task commit, which is atomic; job commit aggressively 
parallelised and optimized for stores whose listStatusIterator calls are 
incremental with prefetching: we can start processing at the first page of task 
manifests found in a listing well the second Page is still being retrieved. 
Also in there: rate limiting, IO Statistics Collection.

HADOOP-17833 I will pick up some of that work, including incremental loading 
and rate limiting. And if we can keep reads and writes below the S3 IOPS 
limits, we should be able to avoid situations where we have to start sleeping 
and re-trying.

HADOOP-17981 is my homework this week -emergency work to deal with a rare but 
current failure in abfs under heavy load.


> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37196) NPE in org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106)

2021-11-02 Thread Sergey Yavorski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Yavorski updated SPARK-37196:

Description: 
Still reproducible in Spark 3.0.1.

+*How to reproduce:*+


*SHELL*

> export SPARK_MAJOR_VERSION=3

*SPARK PART 1*

> spark-shell --master local
SPARK_MAJOR_VERSION is set to 3, using Spark3
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
21/10/25 11:43:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
21/10/25 11:43:58 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
Spark context Web UI available at :4042
Spark context available as 'sc' (master = local, app id = local-1635151438558).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_302)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.sql("select 'dummy' as name, 
100010.7010 as value")
df: org.apache.spark.sql.DataFrame = [name: string, value: decimal(38,16)]

scala> df.write.mode("Overwrite").parquet("test.parquet")

*HIVE*

hive> create external table test_precision(name string, value Decimal(18,6)) 
STORED AS PARQUET LOCATION 'test';
OK
Time taken: 0.067 seconds

*SPARK PART 2*
   
scala> spark.conf.set("spark.sql.hive.convertMetastoreParquet","false");

scala> val df_hive = spark.sql("select * from test_precision");
df_hive: org.apache.spark.sql.DataFrame = [name: string, value: decimal(18,6)]

scala> df_hive.show;
21/10/25 12:04:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1]
*{color:red}java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106){color}*
at 
org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$13(TableReader.scala:465)
at 
org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$13$adapted(TableReader.scala:464)
at 
org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$18(TableReader.scala:493)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/10/25 12:04:51 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
tklis-kappd0005.dev.df.sbrf.ru, executor driver): java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106)
at 
org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$13(TableReader.scala:465)
at 
org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$13$adapted(TableReader.scala:464)
at 
org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$18(TableReader.scala:493)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 

[jira] [Created] (SPARK-37196) NPE in org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106)

2021-11-02 Thread Sergey Yavorski (Jira)
Sergey Yavorski created SPARK-37196:
---

 Summary: NPE in 
org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:106)
 Key: SPARK-37196
 URL: https://issues.apache.org/jira/browse/SPARK-37196
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.0.1
Reporter: Sergey Yavorski


This is still reproducible in Spark 3.0.1:




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic

2021-11-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-37176.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34455
[https://github.com/apache/spark/pull/34455]

> JsonSource's infer should have the same exception handle logic as 
> JacksonParser's parse logic
> -
>
> Key: SPARK-37176
> URL: https://issues.apache.org/jira/browse/SPARK-37176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Minor
> Fix For: 3.3.0
>
>
> JacksonParser's exception handle logic is different with 
> org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different 
> can be saw as below:
> {code:java}
> // code JacksonParser's parse
> try {
>   Utils.tryWithResource(createParser(factory, record)) { parser =>
> // a null first token is equivalent to testing for input.trim.isEmpty
> // but it works on any token stream and not just strings
> parser.nextToken() match {
>   case null => None
>   case _ => rootConverter.apply(parser) match {
> case null => throw 
> QueryExecutionErrors.rootConverterReturnNullError()
> case rows => rows.toSeq
>   }
> }
>   }
> } catch {
>   case e: SparkUpgradeException => throw e
>   case e @ (_: RuntimeException | _: JsonProcessingException | _: 
> MalformedInputException) =>
> // JSON parser currently doesn't support partial results for 
> corrupted records.
> // For such records, all fields other than the field configured by
> // `columnNameOfCorruptRecord` are set to `null`.
> throw BadRecordException(() => recordLiteral(record), () => None, e)
>   case e: CharConversionException if options.encoding.isEmpty =>
> val msg =
>   """JSON parser cannot handle a character in its input.
> |Specifying encoding as an input option explicitly might help to 
> resolve the issue.
> |""".stripMargin + e.getMessage
> val wrappedCharException = new CharConversionException(msg)
> wrappedCharException.initCause(e)
> throw BadRecordException(() => recordLiteral(record), () => None, 
> wrappedCharException)
>   case PartialResultException(row, cause) =>
> throw BadRecordException(
>   record = () => recordLiteral(record),
>   partialResult = () => Some(row),
>   cause)
> }
> {code}
> v.s. 
> {code:java}
> // JsonInferSchema's infer logic
> val mergedTypesFromPartitions = json.mapPartitions { iter =>
>   val factory = options.buildJsonFactory()
>   iter.flatMap { row =>
> try {
>   Utils.tryWithResource(createParser(factory, row)) { parser =>
> parser.nextToken()
> Some(inferField(parser))
>   }
> } catch {
>   case  e @ (_: RuntimeException | _: JsonProcessingException) => 
> parseMode match {
> case PermissiveMode =>
>   Some(StructType(Seq(StructField(columnNameOfCorruptRecord, 
> StringType
> case DropMalformedMode =>
>   None
> case FailFastMode =>
>   throw 
> QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e)
>   }
> }
>   }.reduceOption(typeMerger).toIterator
> }
> {code}
> They should have the same exception handle logic, otherwise it may confuse 
> user because of the inconsistency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37176) JsonSource's infer should have the same exception handle logic as JacksonParser's parse logic

2021-11-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-37176:


Assignee: Xianjin YE

> JsonSource's infer should have the same exception handle logic as 
> JacksonParser's parse logic
> -
>
> Key: SPARK-37176
> URL: https://issues.apache.org/jira/browse/SPARK-37176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Minor
>
> JacksonParser's exception handle logic is different with 
> org.apache.spark.sql.catalyst.json.JsonInferSchema#infer logic, the different 
> can be saw as below:
> {code:java}
> // code JacksonParser's parse
> try {
>   Utils.tryWithResource(createParser(factory, record)) { parser =>
> // a null first token is equivalent to testing for input.trim.isEmpty
> // but it works on any token stream and not just strings
> parser.nextToken() match {
>   case null => None
>   case _ => rootConverter.apply(parser) match {
> case null => throw 
> QueryExecutionErrors.rootConverterReturnNullError()
> case rows => rows.toSeq
>   }
> }
>   }
> } catch {
>   case e: SparkUpgradeException => throw e
>   case e @ (_: RuntimeException | _: JsonProcessingException | _: 
> MalformedInputException) =>
> // JSON parser currently doesn't support partial results for 
> corrupted records.
> // For such records, all fields other than the field configured by
> // `columnNameOfCorruptRecord` are set to `null`.
> throw BadRecordException(() => recordLiteral(record), () => None, e)
>   case e: CharConversionException if options.encoding.isEmpty =>
> val msg =
>   """JSON parser cannot handle a character in its input.
> |Specifying encoding as an input option explicitly might help to 
> resolve the issue.
> |""".stripMargin + e.getMessage
> val wrappedCharException = new CharConversionException(msg)
> wrappedCharException.initCause(e)
> throw BadRecordException(() => recordLiteral(record), () => None, 
> wrappedCharException)
>   case PartialResultException(row, cause) =>
> throw BadRecordException(
>   record = () => recordLiteral(record),
>   partialResult = () => Some(row),
>   cause)
> }
> {code}
> v.s. 
> {code:java}
> // JsonInferSchema's infer logic
> val mergedTypesFromPartitions = json.mapPartitions { iter =>
>   val factory = options.buildJsonFactory()
>   iter.flatMap { row =>
> try {
>   Utils.tryWithResource(createParser(factory, row)) { parser =>
> parser.nextToken()
> Some(inferField(parser))
>   }
> } catch {
>   case  e @ (_: RuntimeException | _: JsonProcessingException) => 
> parseMode match {
> case PermissiveMode =>
>   Some(StructType(Seq(StructField(columnNameOfCorruptRecord, 
> StringType
> case DropMalformedMode =>
>   None
> case FailFastMode =>
>   throw 
> QueryExecutionErrors.malformedRecordsDetectedInSchemaInferenceError(e)
>   }
> }
>   }.reduceOption(typeMerger).toIterator
> }
> {code}
> They should have the same exception handle logic, otherwise it may confuse 
> user because of the inconsistency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37194:


Assignee: (was: Apache Spark)

> Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
> 
>
> Key: SPARK-37194
> URL: https://issues.apache.org/jira/browse/SPARK-37194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> `FileFormatWriter.write` will sort the partition and bucket column before 
> writing. I think this code path assumed the input `partitionColumns` are 
> dynamic but actually it's not. It now is used by three code path:
>  - `FileStreamSink`; it should be always dynamic partition
>  - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
> removed the static partition and `InsertIntoHiveDirCommand` has no partition
>  - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
> `FileFormatWriter.write` without removing static partition because we need it 
> to generate the partition path in `DynamicPartitionDataWriter`
> It shows that the unnecessary sort only affected the 
> `InsertIntoHadoopFsRelationCommand` if we write data with static partition.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37194:


Assignee: Apache Spark

> Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
> 
>
> Key: SPARK-37194
> URL: https://issues.apache.org/jira/browse/SPARK-37194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> `FileFormatWriter.write` will sort the partition and bucket column before 
> writing. I think this code path assumed the input `partitionColumns` are 
> dynamic but actually it's not. It now is used by three code path:
>  - `FileStreamSink`; it should be always dynamic partition
>  - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
> removed the static partition and `InsertIntoHiveDirCommand` has no partition
>  - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
> `FileFormatWriter.write` without removing static partition because we need it 
> to generate the partition path in `DynamicPartitionDataWriter`
> It shows that the unnecessary sort only affected the 
> `InsertIntoHadoopFsRelationCommand` if we write data with static partition.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition

2021-11-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437226#comment-17437226
 ] 

Apache Spark commented on SPARK-37194:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34468

> Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
> 
>
> Key: SPARK-37194
> URL: https://issues.apache.org/jira/browse/SPARK-37194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> `FileFormatWriter.write` will sort the partition and bucket column before 
> writing. I think this code path assumed the input `partitionColumns` are 
> dynamic but actually it's not. It now is used by three code path:
>  - `FileStreamSink`; it should be always dynamic partition
>  - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
> removed the static partition and `InsertIntoHiveDirCommand` has no partition
>  - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
> `FileFormatWriter.write` without removing static partition because we need it 
> to generate the partition path in `DynamicPartitionDataWriter`
> It shows that the unnecessary sort only affected the 
> `InsertIntoHadoopFsRelationCommand` if we write data with static partition.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36895) Add Create Index syntax support

2021-11-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437216#comment-17437216
 ] 

Apache Spark commented on SPARK-36895:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/34467

> Add Create Index syntax support
> ---
>
> Key: SPARK-36895
> URL: https://issues.apache.org/jira/browse/SPARK-36895
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37034) What's the progress of vectorized execution for spark?

2021-11-02 Thread xiaoli (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437213#comment-17437213
 ] 

xiaoli commented on SPARK-37034:


[~cloud_fan] Thanks very much. The hint in your answer helps me find 
ApplyColumnarRulesAndInsertTransitions class in spark, I could do more 
investigation now.

> What's the progress of vectorized execution for spark?
> --
>
> Key: SPARK-37034
> URL: https://issues.apache.org/jira/browse/SPARK-37034
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: xiaoli
>Priority: Major
>
> Spark has support vectorized read for ORC and parquet. What's the progress of 
> other vectorized execution, e.g. vectorized write,  join, aggr, simple 
> operator (string function, math function)? 
> Hive support vectorized execution in early version 
> (https://cwiki.apache.org/confluence/display/hive/vectorized+query+execution) 
> As we know, Spark is replacement of Hive. I guess the reason why Spark does 
> not support vectorized execution maybe the difficulty of design or 
> implementation in Spark is larger. What's the main issue for Spark to support 
> vectorized execution?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37195) Unify v1 and v2 SHOW TBLPROPERTIES tests

2021-11-02 Thread PengLei (Jira)
PengLei created SPARK-37195:
---

 Summary: Unify v1 and v2 SHOW TBLPROPERTIES  tests
 Key: SPARK-37195
 URL: https://issues.apache.org/jira/browse/SPARK-37195
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: PengLei
 Fix For: 3.3.0


Unify v1 and v2 SHOW TBLPROPERTIES tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36924) CAST between ANSI intervals and numerics

2021-11-02 Thread PengLei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437198#comment-17437198
 ] 

PengLei commented on SPARK-36924:
-

woking on this later

> CAST between ANSI intervals and numerics
> 
>
> Key: SPARK-36924
> URL: https://issues.apache.org/jira/browse/SPARK-36924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Support casting between ANSI intervals and numerics. The implementation 
> should follow ANSI SQL standard. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition

2021-11-02 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-37194:
--
Description: 
`FileFormatWriter.write` will sort the partition and bucket column before 
writing. I think this code path assumed the input `partitionColumns` are 
dynamic but actually it's not. It now is used by three code path:
 - `FileStreamSink`; it should be always dynamic partition
 - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
removed the static partition and `InsertIntoHiveDirCommand` has no partition
 - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
`FileFormatWriter.write` without removing static partition because we need it 
to generate the partition path in `DynamicPartitionDataWriter`

It shows that the unnecessary sort only affected the 
`InsertIntoHadoopFsRelationCommand` if we write data with static partition.

 

  was:
`FileFormatWriter.write` will sort the partition and bucket column before 
writing. I think this code path assumed the input `partitionColumns` are 
dynamic but actually it's not. It now is used by three code path:
 - `FileStreamSink`; it should be always dynamic partition
 - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
removed the static partition and `InsertIntoHiveDirCommand` has no partition
 - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
`FileFormatWriter.write` without removing static partition because we need it 
to generate the partition path in `DynamicPartitionDataWriter`

It shows that the unnecessary sort only affected the 
`InsertIntoHadoopFsRelationCommand` if we write data with static partition.

 

Do a simple benchmak:
{code:java}
CREATE TABLE test (id long) USING PARQUET PARTITIONED BY (d string);

-- before this PR, it tooks 1.82 seconds
-- after this PR, it tooks 1.072 seconds
INSERT OVERWRITE TABLE test PARTITION(d='a') SELECT id FROM range(1000);
{code}


> Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
> 
>
> Key: SPARK-37194
> URL: https://issues.apache.org/jira/browse/SPARK-37194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> `FileFormatWriter.write` will sort the partition and bucket column before 
> writing. I think this code path assumed the input `partitionColumns` are 
> dynamic but actually it's not. It now is used by three code path:
>  - `FileStreamSink`; it should be always dynamic partition
>  - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
> removed the static partition and `InsertIntoHiveDirCommand` has no partition
>  - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
> `FileFormatWriter.write` without removing static partition because we need it 
> to generate the partition path in `DynamicPartitionDataWriter`
> It shows that the unnecessary sort only affected the 
> `InsertIntoHadoopFsRelationCommand` if we write data with static partition.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2021-11-02 Thread Oscar Bonilla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437197#comment-17437197
 ] 

Oscar Bonilla commented on SPARK-26365:
---

Hi [~Gangishetty], unfortunately I'm not a contributor to the apache Spark 
project. I only reported the issue because I found it.

You'll have to follow the usual channels if you want this issue prioritized. 
Although I assume it's probably already been fixed in the 3.x branch

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Oscar Bonilla
>Priority: Minor
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition

2021-11-02 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-37194:
--
Description: 
`FileFormatWriter.write` will sort the partition and bucket column before 
writing. I think this code path assumed the input `partitionColumns` are 
dynamic but actually it's not. It now is used by three code path:
 - `FileStreamSink`; it should be always dynamic partition
 - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
removed the static partition and `InsertIntoHiveDirCommand` has no partition
 - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
`FileFormatWriter.write` without removing static partition because we need it 
to generate the partition path in `DynamicPartitionDataWriter`

It shows that the unnecessary sort only affected the 
`InsertIntoHadoopFsRelationCommand` if we write data with static partition.

 

Do a simple benchmak:
{code:java}
CREATE TABLE test (id long) USING PARQUET PARTITIONED BY (d string);

-- before this PR, it tooks 1.82 seconds
-- after this PR, it tooks 1.072 seconds
INSERT OVERWRITE TABLE test PARTITION(d='a') SELECT id FROM range(1000);
{code}

  was:
`FileFormatWriter.write` will sort the partition and bucket column before do 
writing. I think this code path assumed the input `partitionColumns` are 
dynamic but actually it's not. It now is used by three code path:
 - `FileStreamSink`; it should be always dynamic partition
 - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
removed the static partition and `InsertIntoHiveDirCommand` has no partition
 - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
`FileFormatWriter.write` without removing static partition because we need it 
to generate the partition path in `DynamicPartitionDataWriter`

It shows that the unnecessary sort only affected the 
`InsertIntoHadoopFsRelationCommand` if we write data with static partition.

 

Do a simple benchmak:
{code:java}
CREATE TABLE test (id long) USING PARQUET PARTITIONED BY (d string);

-- before this PR, it tooks 1.82 seconds
-- after this PR, it tooks 1.072 seconds
INSERT OVERWRITE TABLE test PARTITION(d='a') SELECT id FROM range(1000);
{code}


> Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition
> 
>
> Key: SPARK-37194
> URL: https://issues.apache.org/jira/browse/SPARK-37194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> `FileFormatWriter.write` will sort the partition and bucket column before 
> writing. I think this code path assumed the input `partitionColumns` are 
> dynamic but actually it's not. It now is used by three code path:
>  - `FileStreamSink`; it should be always dynamic partition
>  - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
> removed the static partition and `InsertIntoHiveDirCommand` has no partition
>  - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
> `FileFormatWriter.write` without removing static partition because we need it 
> to generate the partition path in `DynamicPartitionDataWriter`
> It shows that the unnecessary sort only affected the 
> `InsertIntoHadoopFsRelationCommand` if we write data with static partition.
>  
> Do a simple benchmak:
> {code:java}
> CREATE TABLE test (id long) USING PARQUET PARTITIONED BY (d string);
> -- before this PR, it tooks 1.82 seconds
> -- after this PR, it tooks 1.072 seconds
> INSERT OVERWRITE TABLE test PARTITION(d='a') SELECT id FROM range(1000);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37194) Avoid unnecessary sort in FileFormatWriter if it's not dynamic partition

2021-11-02 Thread XiDuo You (Jira)
XiDuo You created SPARK-37194:
-

 Summary: Avoid unnecessary sort in FileFormatWriter if it's not 
dynamic partition
 Key: SPARK-37194
 URL: https://issues.apache.org/jira/browse/SPARK-37194
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


`FileFormatWriter.write` will sort the partition and bucket column before do 
writing. I think this code path assumed the input `partitionColumns` are 
dynamic but actually it's not. It now is used by three code path:
 - `FileStreamSink`; it should be always dynamic partition
 - `SaveAsHiveFile`; it followed the assuming that `InsertIntoHiveTable` has 
removed the static partition and `InsertIntoHiveDirCommand` has no partition
 - `InsertIntoHadoopFsRelationCommand`; it passed `partitionColumns` into 
`FileFormatWriter.write` without removing static partition because we need it 
to generate the partition path in `DynamicPartitionDataWriter`

It shows that the unnecessary sort only affected the 
`InsertIntoHadoopFsRelationCommand` if we write data with static partition.

 

Do a simple benchmak:
{code:java}
CREATE TABLE test (id long) USING PARQUET PARTITIONED BY (d string);

-- before this PR, it tooks 1.82 seconds
-- after this PR, it tooks 1.072 seconds
INSERT OVERWRITE TABLE test PARTITION(d='a') SELECT id FROM range(1000);
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37174) WARN WindowExec: No Partition Defined is being printed 4 times.

2021-11-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437184#comment-17437184
 ] 

Hyukjin Kwon edited comment on SPARK-37174 at 11/2/21, 7:57 AM:


This is related to default index, see also 
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type.

Spark 3.3 targets to remove such warnings for natively supporting global 
windows. That's slightly orthogonal from this issue though.


was (Author: hyukjin.kwon):
This is related to default index, see also 
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type.
Spark 3.3 targets to remove such warnings.

> WARN WindowExec: No Partition Defined is being printed 4 times. 
> 
>
> Key: SPARK-37174
> URL: https://issues.apache.org/jira/browse/SPARK-37174
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Hi I use this code  
> {code:java}
> f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json")
> pf01 = f01.to_pandas_on_spark()
> pf01 = pf01.rename(columns=lambda x: re.sub(':P$', '', x))
> pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = 
> ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"])
> pf01.info(){code}
>  
>  sometimes it prints 
>   
> {code:java}
>  21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 20:38:04 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  /opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: 
> DataFrame is highly fragmented.  This is usually the result of calling 
> `frame.insert` many times, which has poor performance.  Consider joining all 
> columns at once using pd.concat(axis=1) instead.  To get a de-fragmented 
> frame, use `newframe = frame.copy()`
>    df[column_name] = series
>  /opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` 
> loads all data into the driver's memory. It should only be used if the 
> resulting pandas Series is expected to be small.
>    warnings.warn(message, UserWarning)
>  21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.{code}
>  
>  and some other times it "just" prints 
>   
> {code:java}
>  21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.{code}
> Why does it print df[column_name] = series ?
>   
>  can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ?
>  and warnings.warn(message, UserWarning) ?
>  and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving 
> all data to a single partition, this can cause serious performance 
> degradation.?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37174) WARN WindowExec: No Partition Defined is being printed 4 times.

2021-11-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437184#comment-17437184
 ] 

Hyukjin Kwon commented on SPARK-37174:
--

This is related to default index, see also 
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type.
Spark 3.3 targets to remove such warnings.

> WARN WindowExec: No Partition Defined is being printed 4 times. 
> 
>
> Key: SPARK-37174
> URL: https://issues.apache.org/jira/browse/SPARK-37174
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Hi I use this code  
> {code:java}
> f01 = spark.read.json("/home/test_files/falk/flatted110721/F01.json/*.json")
> pf01 = f01.to_pandas_on_spark()
> pf01 = pf01.rename(columns=lambda x: re.sub(':P$', '', x))
> pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"] = 
> ps.to_datetime(pf01["OBJECT_CONTRACT:DATE_PUBLICATION_NOTICE"])
> pf01.info(){code}
>  
>  sometimes it prints 
>   
> {code:java}
>  21/10/31 20:38:04 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 20:38:04 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  21/10/31 20:38:08 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  /opt/spark/python/pyspark/sql/pandas/conversion.py:214: PerformanceWarning: 
> DataFrame is highly fragmented.  This is usually the result of calling 
> `frame.insert` many times, which has poor performance.  Consider joining all 
> columns at once using pd.concat(axis=1) instead.  To get a de-fragmented 
> frame, use `newframe = frame.copy()`
>    df[column_name] = series
>  /opt/spark/python/pyspark/pandas/utils.py:967: UserWarning: `to_pandas` 
> loads all data into the driver's memory. It should only be used if the 
> resulting pandas Series is expected to be small.
>    warnings.warn(message, UserWarning)
>  21/10/31 20:38:16 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 20:38:18 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.{code}
>  
>  and some other times it "just" prints 
>   
> {code:java}
>  21/10/31 21:24:13 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 21:24:16 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 21:24:22 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.
>  21/10/31 21:24:24 WARN WindowExec: No Partition Defined for Window 
> operation! Moving all data to a single partition, this can cause serious 
> performance degradation.{code}
> Why does it print df[column_name] = series ?
>   
>  can we remove /opt/spark/python/pyspark/pandas/utils.py:967: ?
>  and warnings.warn(message, UserWarning) ?
>  and 3 of WARN WindowExec: No Partition Defined for Window operation! Moving 
> all data to a single partition, this can cause serious performance 
> degradation.?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37177) Support LONG argument to the Spark SQL LIMIT clause

2021-11-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37177:
-
Component/s: (was: Spark Core)
 SQL

> Support LONG argument to the Spark SQL LIMIT clause
> ---
>
> Key: SPARK-37177
> URL: https://issues.apache.org/jira/browse/SPARK-37177
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Douglas Moore
>Priority: Major
>
> Big data sets may exceed INT max thus Limit clause in Spark SQL `LIMIT 
> ` should be expanded to support . 
> Thus
>  `SELECT * FROM  LIMIT 200` would run and not return an error.
>  
> See docs: 
> [https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-limit.html]
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker

2021-11-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437182#comment-17437182
 ] 

Hyukjin Kwon commented on SPARK-37185:
--

can you show the perf diff between both codes?

> DataFrame.take() only uses one worker
> -
>
> Key: SPARK-37185
> URL: https://issues.apache.org/jira/browse/SPARK-37185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
> Environment: CentOS 7
>Reporter: mathieu longtin
>Priority: Major
>
> Say you have query:
> {code:java}
> >>> df = spark.sql("select * from mytable where x = 99"){code}
> Now, out of billions of row, there's only ten rows where x is 99.
> If I do:
> {code:java}
> >>> df.limit(10).collect()
> [Stage 1:>  (0 + 1) / 1]{code}
> It only uses one worker. This takes a really long time since one CPU is 
> reading the billions of row.
> However, if I do this:
> {code:java}
> >>> df.limit(10).rdd.collect()
> [Stage 1:>  (0 + 10) / 22]{code}
> All the workers are running.
> I think there's some optimization issue DataFrame.take(...).
> This did not use to be an issue, but I'm not sure if it was working with 3.0 
> or 2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker

2021-11-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437180#comment-17437180
 ] 

Hyukjin Kwon commented on SPARK-37185:
--

isn't it more optimized to use only one partition on one worker if less data is 
required? 

> DataFrame.take() only uses one worker
> -
>
> Key: SPARK-37185
> URL: https://issues.apache.org/jira/browse/SPARK-37185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
> Environment: CentOS 7
>Reporter: mathieu longtin
>Priority: Major
>
> Say you have query:
> {code:java}
> >>> df = spark.sql("select * from mytable where x = 99"){code}
> Now, out of billions of row, there's only ten rows where x is 99.
> If I do:
> {code:java}
> >>> df.limit(10).collect()
> [Stage 1:>  (0 + 1) / 1]{code}
> It only uses one worker. This takes a really long time since one CPU is 
> reading the billions of row.
> However, if I do this:
> {code:java}
> >>> df.limit(10).rdd.collect()
> [Stage 1:>  (0 + 10) / 22]{code}
> All the workers are running.
> I think there's some optimization issue DataFrame.take(...).
> This did not use to be an issue, but I'm not sure if it was working with 3.0 
> or 2.4.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37189) pyspark.pandas histogram accepts the range option but does not use it

2021-11-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437179#comment-17437179
 ] 

Hyukjin Kwon commented on SPARK-37189:
--

Hey, thanks for reporting issues. [~chconnell]would you mind creating an 
umbrella JIRA and group all the JIRAs you opened? cc [~ueshin] [~itholic] 
[~XinrongM] FYI

> pyspark.pandas histogram accepts the range option but does not use it
> -
>
> Key: SPARK-37189
> URL: https://issues.apache.org/jira/browse/SPARK-37189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> In pyspark.pandas if you write a line like this
> {quote}DF.plot.hist(bins=30, range=[0, 20], title="US Counties -- 
> DeathsPer100k (<20)")
> {quote}
> it compiles and runs, but the plot does not respect the range. All the values 
> are shown.
> The workaround is to create a new DataFrame that pre-selects just the rows 
> you want, but line above should work also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37030) Maven build failed in windows!

2021-11-02 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437178#comment-17437178
 ] 

Hyukjin Kwon commented on SPARK-37030:
--

You need {{bash}} on the path on Windows, either from Cigwin or native bash 
support from Windows.

> Maven build failed in windows!
> --
>
> Key: SPARK-37030
> URL: https://issues.apache.org/jira/browse/SPARK-37030
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
> Environment: OS: Windows 10 Professional
> OS Version: 21H1
> Maven Version: 3.6.3
>  
>Reporter: Shockang
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: image-2021-10-17-22-18-16-616.png
>
>
> I pulled the latest Spark master code on my local windows 10 computer and 
> executed the following command:
> {code:java}
> mvn -DskipTests clean install{code}
> Build failed!
> !image-2021-10-17-22-18-16-616.png!
> {code:java}
> Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run 
> (default) on project spark-core_2.12: An Ant BuildException has occured: 
> Execute failed: java.io.IOException: Cannot run program "bash" (in directory 
> "C:\bigdata\spark\core"): CreateProcess error=2{code}
> It seems that the plugin: maven-antrun-plugin cannot run because of windows 
> no bash. 
> The following code comes from pom.xml in spark-core module.
> {code:java}
> 
>   org.apache.maven.plugins
>   maven-antrun-plugin
>   
>     
>       generate-resources
>       
>         
>         
>           
>             
>             
>             
>           
>         
>       
>       
>         run
>       
>     
>   
> 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Subscribe

2021-11-02 Thread XING JIN


-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37156) Inline type hints for python/pyspark/storagelevel.py

2021-11-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37156.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34437
[https://github.com/apache/spark/pull/34437]

> Inline type hints for python/pyspark/storagelevel.py
> 
>
> Key: SPARK-37156
> URL: https://issues.apache.org/jira/browse/SPARK-37156
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37156) Inline type hints for python/pyspark/storagelevel.py

2021-11-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37156:


Assignee: Apache Spark

> Inline type hints for python/pyspark/storagelevel.py
> 
>
> Key: SPARK-37156
> URL: https://issues.apache.org/jira/browse/SPARK-37156
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37030) Maven build failed in windows!

2021-11-02 Thread Shockang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437168#comment-17437168
 ] 

Shockang commented on SPARK-37030:
--

[~hyukjin.kwon] Even if the community does not support Spark in windows, why no 
one cares about programmers who use windows ...

> Maven build failed in windows!
> --
>
> Key: SPARK-37030
> URL: https://issues.apache.org/jira/browse/SPARK-37030
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
> Environment: OS: Windows 10 Professional
> OS Version: 21H1
> Maven Version: 3.6.3
>  
>Reporter: Shockang
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: image-2021-10-17-22-18-16-616.png
>
>
> I pulled the latest Spark master code on my local windows 10 computer and 
> executed the following command:
> {code:java}
> mvn -DskipTests clean install{code}
> Build failed!
> !image-2021-10-17-22-18-16-616.png!
> {code:java}
> Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run 
> (default) on project spark-core_2.12: An Ant BuildException has occured: 
> Execute failed: java.io.IOException: Cannot run program "bash" (in directory 
> "C:\bigdata\spark\core"): CreateProcess error=2{code}
> It seems that the plugin: maven-antrun-plugin cannot run because of windows 
> no bash. 
> The following code comes from pom.xml in spark-core module.
> {code:java}
> 
>   org.apache.maven.plugins
>   maven-antrun-plugin
>   
>     
>       generate-resources
>       
>         
>         
>           
>             
>             
>             
>           
>         
>       
>       
>         run
>       
>     
>   
> 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37152) Inline type hints for python/pyspark/context.py

2021-11-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437161#comment-17437161
 ] 

Apache Spark commented on SPARK-37152:
--

User 'ByronHsu' has created a pull request for this issue:
https://github.com/apache/spark/pull/34466

> Inline type hints for python/pyspark/context.py
> ---
>
> Key: SPARK-37152
> URL: https://issues.apache.org/jira/browse/SPARK-37152
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37152) Inline type hints for python/pyspark/context.py

2021-11-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437159#comment-17437159
 ] 

Apache Spark commented on SPARK-37152:
--

User 'ByronHsu' has created a pull request for this issue:
https://github.com/apache/spark/pull/34466

> Inline type hints for python/pyspark/context.py
> ---
>
> Key: SPARK-37152
> URL: https://issues.apache.org/jira/browse/SPARK-37152
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37152) Inline type hints for python/pyspark/context.py

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37152:


Assignee: Apache Spark

> Inline type hints for python/pyspark/context.py
> ---
>
> Key: SPARK-37152
> URL: https://issues.apache.org/jira/browse/SPARK-37152
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37152) Inline type hints for python/pyspark/context.py

2021-11-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37152:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/context.py
> ---
>
> Key: SPARK-37152
> URL: https://issues.apache.org/jira/browse/SPARK-37152
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Byron Hsu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37168) Improve error messages for SQL functions and operators under ANSI mode

2021-11-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37168:
---

Assignee: Allison Wang

> Improve error messages for SQL functions and operators under ANSI mode
> --
>
> Key: SPARK-37168
> URL: https://issues.apache.org/jira/browse/SPARK-37168
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Make error messages more actionable when ANSI mode is enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37168) Improve error messages for SQL functions and operators under ANSI mode

2021-11-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37168.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34443
[https://github.com/apache/spark/pull/34443]

> Improve error messages for SQL functions and operators under ANSI mode
> --
>
> Key: SPARK-37168
> URL: https://issues.apache.org/jira/browse/SPARK-37168
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Make error messages more actionable when ANSI mode is enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37164) Add ExpressionBuilder for functions with complex overloads

2021-11-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37164:
---

Assignee: Wenchen Fan

> Add ExpressionBuilder for functions with complex overloads
> --
>
> Key: SPARK-37164
> URL: https://issues.apache.org/jira/browse/SPARK-37164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37164) Add ExpressionBuilder for functions with complex overloads

2021-11-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37164.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34441
[https://github.com/apache/spark/pull/34441]

> Add ExpressionBuilder for functions with complex overloads
> --
>
> Key: SPARK-37164
> URL: https://issues.apache.org/jira/browse/SPARK-37164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org