from:"Jia Fan \(Jira\)"

[jira] [Created] (SPARK-45510) Replace `scala.collection.generic.Growable` to `scala.collection.mutable.Growable`

2023-10-11 Thread Jia Fan (Jira)

Jia Fan created SPARK-45510:
---

 Summary: Replace `scala.collection.generic.Growable` to 
`scala.collection.mutable.Growable`
 Key: SPARK-45510
 URL: https://issues.apache.org/jira/browse/SPARK-45510
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jia Fan


Replace `scala.collection.generic.Growable` to 
`scala.collection.mutable.Growable`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45424) Regression in CSV schema inference when timestamps do not match specified timestampFormat

2023-10-05 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772442#comment-17772442
 ] 

Jia Fan commented on SPARK-45424:
-

Thanks [~andygrove] , I found the reason of the bug. Let me create a PR for 
this.

> Regression in CSV schema inference when timestamps do not match specified 
> timestampFormat
> -
>
> Key: SPARK-45424
> URL: https://issues.apache.org/jira/browse/SPARK-45424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Andy Grove
>Priority: Major
>
> There is a regression in Spark 3.5.0 when inferring the schema of CSV files 
> containing timestamps, where a column will be inferred as a timestamp even if 
> the contents do not match the specified timestampFormat.
> *Test Data*
> I have the following CSV file:
> {code:java}
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> {code}
> *Spark 3.4.0 Behavior (correct)*
> In Spark 3.4.0, if I specify the correct timestamp format, then the schema is 
> inferred as timestamp:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> If I specify an incompatible timestampFormat, then the schema is inferred as 
> string:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}
> *Spark 3.5.0*
> In Spark 3.5.0, the column will be inferred as timestamp even if the data 
> does not match the specified timestampFormat.
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> Reading the DataFrame then results in an error:
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text 
> '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45433) CSV/JSON schema inference when timestamps do not match specified timestampFormat with only one row on each partition report error

2023-10-05 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-45433:

Description: 
CSV/JSON schema inference when timestamps do not match specified 
timestampFormat with `only one row on each partition` report error.
{code:java}
//eg
val csv = spark.read.option("timestampFormat", "-MM-dd'T'HH:mm:ss")
  .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS())
csv.show() {code}
{code:java}
//error
Caused by: java.time.format.DateTimeParseException: Text 
'2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19 
{code}
This bug affect 3.3/3.4/3.5. Unlike 
https://issues.apache.org/jira/browse/SPARK-45424 , this is a different bug but 
has the same error message

  was:
CSV/JSON schema inference when timestamps do not match specified 
timestampFormat with `only one row on each partition` report error.

 
{code:java}
//eg
val csv = spark.read.option("timestampFormat", "-MM-dd'T'HH:mm:ss")
  .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS())
csv.show() {code}
{code:java}
//error
Caused by: java.time.format.DateTimeParseException: Text 
'2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19 
{code}
This bug affect 3.3/3.4/3.5. Unlike 
https://issues.apache.org/jira/browse/SPARK-45424 , this is a different bug but 
has the same error message


> CSV/JSON schema inference when timestamps do not match specified 
> timestampFormat with only one row on each partition report error
> -
>
> Key: SPARK-45433
> URL: https://issues.apache.org/jira/browse/SPARK-45433
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> CSV/JSON schema inference when timestamps do not match specified 
> timestampFormat with `only one row on each partition` report error.
> {code:java}
> //eg
> val csv = spark.read.option("timestampFormat", "-MM-dd'T'HH:mm:ss")
>   .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS())
> csv.show() {code}
> {code:java}
> //error
> Caused by: java.time.format.DateTimeParseException: Text 
> '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 
> 19 {code}
> This bug affect 3.3/3.4/3.5. Unlike 
> https://issues.apache.org/jira/browse/SPARK-45424 , this is a different bug 
> but has the same error message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45433) CSV/JSON schema inference when timestamps do not match specified timestampFormat with only one row on each partition report error

2023-10-05 Thread Jia Fan (Jira)

Jia Fan created SPARK-45433:
---

 Summary: CSV/JSON schema inference when timestamps do not match 
specified timestampFormat with only one row on each partition report error
 Key: SPARK-45433
 URL: https://issues.apache.org/jira/browse/SPARK-45433
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.0, 3.3.0
Reporter: Jia Fan


CSV/JSON schema inference when timestamps do not match specified 
timestampFormat with `only one row on each partition` report error.

 
{code:java}
//eg
val csv = spark.read.option("timestampFormat", "-MM-dd'T'HH:mm:ss")
  .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS())
csv.show() {code}
{code:java}
//error
Caused by: java.time.format.DateTimeParseException: Text 
'2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19 
{code}
This bug affect 3.3/3.4/3.5. Unlike 
https://issues.apache.org/jira/browse/SPARK-45424 , this is a different bug but 
has the same error message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45413) Add warning for prepare drop LevelDB support

2023-10-04 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-45413:

Summary: Add warning for prepare drop LevelDB support  (was: Drop leveldb 
support for `spark.history.store.hybridStore.diskBackend`)

> Add warning for prepare drop LevelDB support
> 
>
> Key: SPARK-45413
> URL: https://issues.apache.org/jira/browse/SPARK-45413
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Jia Fan
>Priority: Major
>  Labels: pull-request-available
>
> Remove leveldb support for `spark.history.store.hybridStore.diskBackend`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44223) Drop leveldb support

2023-10-04 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771824#comment-17771824
 ] 

Jia Fan commented on SPARK-44223:
-

Drop leveldb support for `spark.shuffle.service.db.backend` will be created 
after 4.0.0 released.

> Drop leveldb support
> 
>
> Key: SPARK-44223
> URL: https://issues.apache.org/jira/browse/SPARK-44223
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> The leveldb project seems to be no longer maintained, and we can always 
> replace it with rocksdb. I think we can remove support and dependencies on 
> leveldb in Spark 4.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45413) Drop leveldb support for `spark.history.store.hybridStore.diskBackend`

2023-10-04 Thread Jia Fan (Jira)

Jia Fan created SPARK-45413:
---

 Summary: Drop leveldb support for 
`spark.history.store.hybridStore.diskBackend`
 Key: SPARK-45413
 URL: https://issues.apache.org/jira/browse/SPARK-45413
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Jia Fan


Remove leveldb support for `spark.history.store.hybridStore.diskBackend`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45351) Change RocksDB as default shuffle service db backend

2023-09-26 Thread Jia Fan (Jira)

Jia Fan created SPARK-45351:
---

 Summary: Change RocksDB as default shuffle service db backend
 Key: SPARK-45351
 URL: https://issues.apache.org/jira/browse/SPARK-45351
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Jia Fan


Change RocksDB as default shuffle service db backend, because we will remove 
leveldb in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45338) Remove scala.collection.JavaConverters

2023-09-26 Thread Jia Fan (Jira)

Jia Fan created SPARK-45338:
---

 Summary: Remove scala.collection.JavaConverters
 Key: SPARK-45338
 URL: https://issues.apache.org/jira/browse/SPARK-45338
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Affects Versions: 4.0.0
Reporter: Jia Fan


Remove deprecated scala.collection.JavaConverters, replaced by 
scala.jdk.CollectionConverters



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45116) Add some comment for param of JdbcDialect createTable

2023-09-11 Thread Jia Fan (Jira)

Jia Fan created SPARK-45116:
---

 Summary: Add some comment for param of JdbcDialect createTable
 Key: SPARK-45116
 URL: https://issues.apache.org/jira/browse/SPARK-45116
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Jia Fan


Since SPARK-41516 , add {{createTable}} to {{{}JdbcDialect{}}}. But doesn't add 
comment for param.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45075) Alter table with invalid default value will not report error

2023-09-04 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-45075:

Summary: Alter table with invalid default value will not report error  
(was: Alter table with invaild default value will not report error)

> Alter table with invalid default value will not report error
> 
>
> Key: SPARK-45075
> URL: https://issues.apache.org/jira/browse/SPARK-45075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> create table t(i boolean, s bigint);
> alter table t alter column s set default badvalue;
>  
> The code wouldn't report error on DataSource V2, not align with V1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45075) Alter table with invaild default value will not report error

2023-09-04 Thread Jia Fan (Jira)

Jia Fan created SPARK-45075:
---

 Summary: Alter table with invaild default value will not report 
error
 Key: SPARK-45075
 URL: https://issues.apache.org/jira/browse/SPARK-45075
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1, 3.5.0
Reporter: Jia Fan


create table t(i boolean, s bigint);
alter table t alter column s set default badvalue;
 
The code wouldn't report error on DataSource V2, not align with V1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-45059) Add try_reflect to Scala and Python

2023-09-01 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-45059:

Summary: Add try_reflect to Scala and Python  (was: Add to_reflect to Scala 
and Python)

> Add try_reflect to Scala and Python
> ---
>
> Key: SPARK-45059
> URL: https://issues.apache.org/jira/browse/SPARK-45059
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Jia Fan
>Priority: Major
>
> Add `try_reflect` to spark connect and pyspark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-45059) Add to_reflect to Scala and Python

2023-09-01 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-45059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761434#comment-17761434
 ] 

Jia Fan commented on SPARK-45059:
-

I'm working on it

> Add to_reflect to Scala and Python
> --
>
> Key: SPARK-45059
> URL: https://issues.apache.org/jira/browse/SPARK-45059
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Jia Fan
>Priority: Major
>
> Add `try_reflect` to spark connect and pyspark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-45059) Add to_reflect to Scala and Python

2023-09-01 Thread Jia Fan (Jira)

Jia Fan created SPARK-45059:
---

 Summary: Add to_reflect to Scala and Python
 Key: SPARK-45059
 URL: https://issues.apache.org/jira/browse/SPARK-45059
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark, SQL
Affects Versions: 4.0.0
Reporter: Jia Fan


Add `try_reflect` to spark connect and pyspark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44735) Log a warning when inserting columns with the same name by row that don't match up

2023-08-31 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760971#comment-17760971
 ] 

Jia Fan commented on SPARK-44735:
-

It's a good improve, let me implement it.

> Log a warning when inserting columns with the same name by row that don't 
> match up
> --
>
> Key: SPARK-44735
> URL: https://issues.apache.org/jira/browse/SPARK-44735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 4.0.0
>Reporter: Holden Karau
>Priority: Minor
>
> With SPARK-42750 people can now insert by name, but sometimes people forget 
> it. We should log warning when it *looks like* someone forgot it (e.g. insert 
> by column number with all the same names *but* not matching up in row).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44988) Parquet INT64 (TIMESTAMP(NANOS,false)) throwing Illegal Parquet type

2023-08-31 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760931#comment-17760931
 ] 

Jia Fan commented on SPARK-44988:
-

Have you tried setting spark.sql.legacy.parquet.nanosAsLong to true？

> Parquet INT64 (TIMESTAMP(NANOS,false)) throwing Illegal Parquet type
> 
>
> Key: SPARK-44988
> URL: https://issues.apache.org/jira/browse/SPARK-44988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Flavio Odas
>Priority: Critical
>
> This bug seems similar to https://issues.apache.org/jira/browse/SPARK-40819, 
> except that it's a problem with INT64 (TIMESTAMP(NANOS,false)), instead of 
> INT64 (TIMESTAMP(NANOS,true)).
> The error happens whenever I'm trying to read:
> {code:java}
> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 
> (TIMESTAMP(NANOS,false)).
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1762)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:206)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertPrimitiveField$2(ParquetSchemaConverter.scala:283)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:187)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3(ParquetSchemaConverter.scala:147)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3$adapted(ParquetSchemaConverter.scala:117)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.immutable.Range.foreach(Range.scala:158)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertInternal(ParquetSchemaConverter.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:87)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:493)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:493)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:473)
>   at scala.collection.immutable.Stream.map(Stream.scala:418)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:473)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:464)
>   at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:79)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>   at org.apache.spark.scheduler.Task.run(Task.scala:139)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44978) Fix SQLQueryTestSuite unable create table normally

2023-08-27 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-44978:

Description: 
When we repeatedly execute SQLQueryTestSuite to generate the golden file, the 
warehouse file executed last time is not cleaned up (maybe killed when test not 
finish), resulting in an error result 

!image-2023-08-27-14-25-21-843.png!

  was:When we repeatedly execute SQLQueryTestSuite to generate the golden file, 
the warehouse file executed last time is not cleaned up (maybe killed when test 
not finish), resulting in an error result !image-2023-08-27-14-22-43-361.png!


> Fix SQLQueryTestSuite unable create table normally
> --
>
> Key: SPARK-44978
> URL: https://issues.apache.org/jira/browse/SPARK-44978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
> Attachments: image-2023-08-27-14-25-21-843.png
>
>
> When we repeatedly execute SQLQueryTestSuite to generate the golden file, the 
> warehouse file executed last time is not cleaned up (maybe killed when test 
> not finish), resulting in an error result 
> !image-2023-08-27-14-25-21-843.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44978) Fix SQLQueryTestSuite unable create table normally

2023-08-27 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-44978:

Attachment: image-2023-08-27-14-25-21-843.png

> Fix SQLQueryTestSuite unable create table normally
> --
>
> Key: SPARK-44978
> URL: https://issues.apache.org/jira/browse/SPARK-44978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
> Attachments: image-2023-08-27-14-25-21-843.png
>
>
> When we repeatedly execute SQLQueryTestSuite to generate the golden file, the 
> warehouse file executed last time is not cleaned up (maybe killed when test 
> not finish), resulting in an error result !image-2023-08-27-14-22-43-361.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44978) Fix SQLQueryTestSuite unable create table normally

2023-08-27 Thread Jia Fan (Jira)

Jia Fan created SPARK-44978:
---

 Summary: Fix SQLQueryTestSuite unable create table normally
 Key: SPARK-44978
 URL: https://issues.apache.org/jira/browse/SPARK-44978
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


When we repeatedly execute SQLQueryTestSuite to generate the golden file, the 
warehouse file executed last time is not cleaned up (maybe killed when test not 
finish), resulting in an error result !image-2023-08-27-14-22-43-361.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44975) BinaryArithmetic with useless override resolved field

2023-08-25 Thread Jia Fan (Jira)

Jia Fan created SPARK-44975:
---

 Summary: BinaryArithmetic with useless override resolved field
 Key: SPARK-44975
 URL: https://issues.apache.org/jira/browse/SPARK-44975
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


BinaryArithmetic with useless override resolved field, we should remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44869) Add doc for insert by name statement

2023-08-18 Thread Jia Fan (Jira)

Jia Fan created SPARK-44869:
---

 Summary: Add doc for insert by name statement
 Key: SPARK-44869
 URL: https://issues.apache.org/jira/browse/SPARK-44869
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.5.0
Reporter: Jia Fan


We should add doc for insert into by name statement which supported by 
https://issues.apache.org/jira/browse/SPARK-42750



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42132) DeduplicateRelations rule breaks plan when co-grouping the same DataFrame

2023-08-10 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan resolved SPARK-42132.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

> DeduplicateRelations rule breaks plan when co-grouping the same DataFrame
> -
>
> Key: SPARK-42132
> URL: https://issues.apache.org/jira/browse/SPARK-42132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.3.1, 3.2.3, 3.4.0, 3.5.0
>Reporter: Enrico Minack
>Priority: Major
>  Labels: correctness
> Fix For: 3.5.0
>
>
> Co-grouping two DataFrames that share references breaks on the 
> DeduplicateRelations rule:
> {code:java}
> val df = spark.range(3)
> val left_grouped_df = df.groupBy("id").as[Long, Long]
> val right_grouped_df = df.groupBy("id").as[Long, Long]
> val cogroup_df = left_grouped_df.cogroup(right_grouped_df) {
>   case (key, left, right) => left
> }
> cogroup_df.explain()
> {code}
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- SerializeFromObject [input[0, bigint, false] AS value#12L]
>+- CoGroup, id#0: bigint, id#0: bigint, id#0: bigint, [id#13L], [id#13L], 
> [id#13L], [id#13L], obj#11: bigint
>   :- !Sort [id#13L ASC NULLS FIRST], false, 0
>   :  +- !Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=16]
>   : +- Range (0, 3, step=1, splits=16)
>   +- Sort [id#13L ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=17]
> +- Range (0, 3, step=1, splits=16)
> {code}
> The DataFrame cannot be computed:
> {code:java}
> cogroup_df.show()
> {code}
> {code:java}
> java.lang.IllegalStateException: Couldn't find id#13L in [id#0L]
> {code}
> The rule replaces `id#0L` on the right side with `id#13L` while replacing all 
> occurrences in `CoGroup`. Some occurrences of `id#0L` in `CoGroup`refer to 
> the left side and should not be replaced. Further, `id#0L` of the right 
> deserializer is not replaced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44754) Improve DeduplicateRelations rewriteAttrs compatibility

2023-08-10 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-44754:

Description: 
{{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }}

{{{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
{{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
{{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
{{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
{{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState 
and}} {{FlatMapCoGroupsInPandas}}

{{To make sure DeduplicateRelations rewriteAttrs will rewrite its attribute 
normally. Also should fix the error behavior follow 
[https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}

  was:
{{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }}

{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
{{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
{{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
{{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
{{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState 
and}} {{FlatMapCoGroupsInPandas

{{{}{{{}{}}}{{{}{}}}{}}}{{{}{{To make sure DeduplicateRelations rewriteAttrs 
will rewrite its attribute normally. Also should fix the error behavior follow 
[https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}{}}}


> Improve DeduplicateRelations rewriteAttrs compatibility
> ---
>
> Key: SPARK-44754
> URL: https://issues.apache.org/jira/browse/SPARK-44754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> {{Follow [https://github.com/apache/spark/pull/41554,] we should add test for 
> }}
> {{{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
> {{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
> {{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
> {{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
> {{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, 
> {{FlatMapGroupsInPandasWithState and}} {{FlatMapCoGroupsInPandas}}
> {{To make sure DeduplicateRelations rewriteAttrs will rewrite its attribute 
> normally. Also should fix the error behavior follow 
> [https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44754) Improve DeduplicateRelations rewriteAttrs compatibility

2023-08-10 Thread Jia Fan (Jira)

Jia Fan created SPARK-44754:
---

 Summary: Improve DeduplicateRelations rewriteAttrs compatibility
 Key: SPARK-44754
 URL: https://issues.apache.org/jira/browse/SPARK-44754
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


{{Follow [https://github.com/apache/spark/pull/41554,] we should add test for }}

{}MapPartitionsInR{}}}, {{{}MapPartitionsInRWithArrow{}}}, 
{{{}MapElements{}}}, {{{}MapGroups{}}}, {{{}FlatMapGroupsWithState{}}}, 
{{{}FlatMapGroupsInR{}}}, {{{}FlatMapGroupsInR{}}}, 
{{{}FlatMapGroupsInRWithArrow{}}}, {{{}FlatMapGroupsInPandas{}}}, 
{{{}MapInPandas{}}}, {{{}PythonMapInArrow{}}}, {{FlatMapGroupsInPandasWithState 
and}} {{FlatMapCoGroupsInPandas

{{{}{{{}{}}}{{{}{}}}{}}}{{{}{{To make sure DeduplicateRelations rewriteAttrs 
will rewrite its attribute normally. Also should fix the error behavior follow 
[https://github.com/apache/spark/pull/41554|https://github.com/apache/spark/pull/41554,]}}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44685) Remove deprecated Catalog#createExternalTable

2023-08-05 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-44685:

Docs Text: The deprecated methods `createExternalTable` have been removed. 
Use `createTable` instead.

> Remove deprecated Catalog#createExternalTable
> -
>
> Key: SPARK-44685
> URL: https://issues.apache.org/jira/browse/SPARK-44685
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jia Fan
>Priority: Major
>  Labels: release-notes
>
> I should remove Catalog#createExternalTable becuase it deprecated when 2.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44685) Remove deprecated Catalog#createExternalTable

2023-08-05 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-44685:

Labels: release-notes  (was: )

> Remove deprecated Catalog#createExternalTable
> -
>
> Key: SPARK-44685
> URL: https://issues.apache.org/jira/browse/SPARK-44685
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jia Fan
>Priority: Major
>  Labels: release-notes
>
> I should remove Catalog#createExternalTable becuase it deprecated when 2.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44685) Remove deprecated Catalog#createExternalTable

2023-08-05 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-44685:

Affects Version/s: 4.0.0
   (was: 3.5.0)

> Remove deprecated Catalog#createExternalTable
> -
>
> Key: SPARK-44685
> URL: https://issues.apache.org/jira/browse/SPARK-44685
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jia Fan
>Priority: Major
>
> I should remove Catalog#createExternalTable becuase it deprecated when 2.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44685) Remove deprecated Catalog#createExternalTable

2023-08-04 Thread Jia Fan (Jira)

Jia Fan created SPARK-44685:
---

 Summary: Remove deprecated Catalog#createExternalTable
 Key: SPARK-44685
 URL: https://issues.apache.org/jira/browse/SPARK-44685
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


I should remove Catalog#createExternalTable becuase it deprecated when 2.2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-44668) ObjectMapper are threadsafe, we can reuse it in Object

2023-08-03 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan closed SPARK-44668.
---

> ObjectMapper are threadsafe, we can reuse it in Object
> --
>
> Key: SPARK-44668
> URL: https://issues.apache.org/jira/browse/SPARK-44668
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Jia Fan
>Priority: Major
>
> ObjectMapper are threadsafe, we can reuse it in Object. But we create it in 
> trait, that's mean each object will create an ObjectMapper.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44668) ObjectMapper are threadsafe, we can reuse it in Object

2023-08-03 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan resolved SPARK-44668.
-
Resolution: Invalid

> ObjectMapper are threadsafe, we can reuse it in Object
> --
>
> Key: SPARK-44668
> URL: https://issues.apache.org/jira/browse/SPARK-44668
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Jia Fan
>Priority: Major
>
> ObjectMapper are threadsafe, we can reuse it in Object. But we create it in 
> trait, that's mean each object will create an ObjectMapper.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44668) ObjectMapper are threadsafe, we can reuse it in Object

2023-08-03 Thread Jia Fan (Jira)

Jia Fan created SPARK-44668:
---

 Summary: ObjectMapper are threadsafe, we can reuse it in Object
 Key: SPARK-44668
 URL: https://issues.apache.org/jira/browse/SPARK-44668
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Jia Fan


ObjectMapper are threadsafe, we can reuse it in Object. But we create it in 
trait, that's mean each object will create an ObjectMapper.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41636) DataSourceStrategy#selectFilters returns predicates in non-deterministic order

2023-08-02 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-41636:

Affects Version/s: 3.4.1
   3.4.0

> DataSourceStrategy#selectFilters returns predicates in non-deterministic order
> --
>
> Key: SPARK-41636
> URL: https://issues.apache.org/jira/browse/SPARK-41636
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.4.0, 3.4.1
>Reporter: Jonny Serencsa
>Priority: Major
>
> Method 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy#selectFilters, 
> which is used to determine "pushdown-able" filters, does not preserve the 
> order of the input {{Seq[Expression]}} nor does it return the same order 
> across the same plans (modulo ExprId differences). This is resulting in 
> CodeGenerator cache misses even when the exact same LogicalPlan is executed. 
> The aforementioned method does not attempt to maintain the order of the input 
> predicates, though it happens to do so when there are less than 5 
> pushdown-able {{Expression}} in the input (due to some "small maps" logic in 
> {{{}scala.collection.TraversableOnce#toMap{}}}). 
> Returning in the same order as the input will reduce churn on the 
> CodeGenerator cache under prolonged workloads that execute queries that are 
> very similar. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44577) INSERT BY NAME returns non-sensical error message

2023-07-28 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17748772#comment-17748772
 ] 

Jia Fan commented on SPARK-44577:
-

https://github.com/apache/spark/pull/42220

> INSERT BY NAME returns non-sensical error message
> -
>
> Key: SPARK-44577
> URL: https://issues.apache.org/jira/browse/SPARK-44577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> CREATE TABLE bug(c1 INT);
> INSERT INTO bug BY NAME SELECT 1 AS c2;
> ==> Multi-part identifier cannot be empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44577) INSERT BY NAME returns non-sensical error message

2023-07-28 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17748769#comment-17748769
 ] 

Jia Fan commented on SPARK-44577:
-

Let me fix this, thanks for report.

> INSERT BY NAME returns non-sensical error message
> -
>
> Key: SPARK-44577
> URL: https://issues.apache.org/jira/browse/SPARK-44577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Major
>
> CREATE TABLE bug(c1 INT);
> INSERT INTO bug BY NAME SELECT 1 AS c2;
> ==> Multi-part identifier cannot be empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42118) Wrong result when parsing a multiline JSON file with differing types for same column

2023-07-20 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17745345#comment-17745345
 ] 

Jia Fan commented on SPARK-42118:
-

Seem like already fixed on master branch.

> Wrong result when parsing a multiline JSON file with differing types for same 
> column
> 
>
> Key: SPARK-42118
> URL: https://issues.apache.org/jira/browse/SPARK-42118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Dilip Biswal
>Priority: Major
>
> Here is a simple reproduction of the problem. We have a JSON file whose 
> content looks like following and is in multiLine format.
> {code}
> [{"name":""},{"name":123.34}]
> {code}
> Here is the result of spark query when we read the above content.
> scala> val df = spark.read.format("json").option("multiLine", 
> true).load("/tmp/json")
> df: org.apache.spark.sql.DataFrame = [name: double]
> scala> df.show(false)
> ++
> |name|
> ++
> |null|
> ++
> scala> df.count
> res5: Long = 2
> This is quite a serious problem for us as it's causing us to master corrupt 
> data in lake. If there is some issue with parsing the input, we expect spark 
> set the "_corrupt_record" so that we can act on it. Please note that df.count 
> is reporting 2 rows where as df.show only reports 1 row with null value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43778) RewriteCorrelatedScalarSubquery should handle duplicate attributes

2023-07-19 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan resolved SPARK-43778.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

> RewriteCorrelatedScalarSubquery should handle duplicate attributes
> --
>
> Key: SPARK-43778
> URL: https://issues.apache.org/jira/browse/SPARK-43778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Andrey Gubichev
>Priority: Major
> Fix For: 4.0.0
>
>
> This is a correctness problem caused by the fact that the decorrelation rule 
> does not dedup join attributes properly. This leads to the join on (c1 = c1), 
> which is simplified to True and the join becomes a cross product.
>  
> Example query:
>  
> {code:java}
> create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)
> select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 having cnt 
> = 0) from t t1
> -- Correct answer: [(0, 1, null), (0, 2, null), (1, 2, null)]
> +---+---+--+
> |c1 |c2 |scalarsubquery(c1)|
> +---+---+--+
> |0  |1  |null  |
> |0  |1  |null  |
> |0  |2  |null  |
> |0  |2  |null  |
> |1  |2  |null  |
> |1  |2  |null  |
> +---+---+--+ {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43778) RewriteCorrelatedScalarSubquery should handle duplicate attributes

2023-07-19 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744875#comment-17744875
 ] 

Jia Fan edited comment on SPARK-43778 at 7/20/23 5:07 AM:
--

This ticket already fixed by https://issues.apache.org/jira/browse/SPARK-43838. 
Should we add backport for this? [~cloud_fan] 


was (Author: fanjia):
This ticket already fixed by https://issues.apache.org/jira/browse/SPARK-43838.

> RewriteCorrelatedScalarSubquery should handle duplicate attributes
> --
>
> Key: SPARK-43778
> URL: https://issues.apache.org/jira/browse/SPARK-43778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Andrey Gubichev
>Priority: Major
>
> This is a correctness problem caused by the fact that the decorrelation rule 
> does not dedup join attributes properly. This leads to the join on (c1 = c1), 
> which is simplified to True and the join becomes a cross product.
>  
> Example query:
>  
> {code:java}
> create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)
> select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 having cnt 
> = 0) from t t1
> -- Correct answer: [(0, 1, null), (0, 2, null), (1, 2, null)]
> +---+---+--+
> |c1 |c2 |scalarsubquery(c1)|
> +---+---+--+
> |0  |1  |null  |
> |0  |1  |null  |
> |0  |2  |null  |
> |0  |2  |null  |
> |1  |2  |null  |
> |1  |2  |null  |
> +---+---+--+ {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43778) RewriteCorrelatedScalarSubquery should handle duplicate attributes

2023-07-19 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744875#comment-17744875
 ] 

Jia Fan commented on SPARK-43778:
-

This ticket already fixed by https://issues.apache.org/jira/browse/SPARK-43838.

> RewriteCorrelatedScalarSubquery should handle duplicate attributes
> --
>
> Key: SPARK-43778
> URL: https://issues.apache.org/jira/browse/SPARK-43778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Andrey Gubichev
>Priority: Major
>
> This is a correctness problem caused by the fact that the decorrelation rule 
> does not dedup join attributes properly. This leads to the join on (c1 = c1), 
> which is simplified to True and the join becomes a cross product.
>  
> Example query:
>  
> {code:java}
> create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)
> select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 having cnt 
> = 0) from t t1
> -- Correct answer: [(0, 1, null), (0, 2, null), (1, 2, null)]
> +---+---+--+
> |c1 |c2 |scalarsubquery(c1)|
> +---+---+--+
> |0  |1  |null  |
> |0  |1  |null  |
> |0  |2  |null  |
> |0  |2  |null  |
> |1  |2  |null  |
> |1  |2  |null  |
> +---+---+--+ {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44487) KubernetesSuite report NPE when not set spark.kubernetes.test.unpackSparkDir

2023-07-19 Thread Jia Fan (Jira)

Jia Fan created SPARK-44487:
---

 Summary: KubernetesSuite report NPE when not set 
spark.kubernetes.test.unpackSparkDir
 Key: SPARK-44487
 URL: https://issues.apache.org/jira/browse/SPARK-44487
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.4.1
Reporter: Jia Fan


KubernetesSuite report NPE when not set spark.kubernetes.test.unpackSparkDir

 

Exception encountered when invoking run on a nested suite.
java.lang.NullPointerException
    at sun.nio.fs.UnixPath.normalizeAndCheck(UnixPath.java:77)
    at sun.nio.fs.UnixPath.(UnixPath.java:71)
    at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
    at java.nio.file.Paths.get(Paths.java:84)
    at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.$anonfun$beforeAll$4(KubernetesSuite.scala:164)
    at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.$anonfun$beforeAll$4$adapted(KubernetesSuite.scala:163)
    at scala.collection.LinearSeqOptimized.find(LinearSeqOptimized.scala:115)
    at scala.collection.LinearSeqOptimized.find$(LinearSeqOptimized.scala:112)
    at scala.collection.immutable.List.find(List.scala:91)
    at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:163)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44428) Add test case for all PartitionEvaluator API

2023-07-17 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan resolved SPARK-44428.
-
Resolution: Not A Problem

> Add test case for all PartitionEvaluator API
> 
>
> Key: SPARK-44428
> URL: https://issues.apache.org/jira/browse/SPARK-44428
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> We need to add PartitionEvaluator API enabled use cases to existing SQL tests 
> to ensure that all PartitionEvaluator API changes work correctly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44443) Use PartitionEvaluator API in CoGroupExec, DeserializeToObjectExec, ExternalRDDScanExec

2023-07-15 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743353#comment-17743353
 ] 

Jia Fan commented on SPARK-3:
-

I'm working on it.

> Use PartitionEvaluator API in CoGroupExec, DeserializeToObjectExec, 
> ExternalRDDScanExec
> ---
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> Use PartitionEvaluator API in CoGroupExec, DeserializeToObjectExec, 
> ExternalRDDScanExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44443) Use PartitionEvaluator API in CoGroupExec, DeserializeToObjectExec, ExternalRDDScanExec

2023-07-15 Thread Jia Fan (Jira)

Jia Fan created SPARK-3:
---

 Summary: Use PartitionEvaluator API in CoGroupExec, 
DeserializeToObjectExec, ExternalRDDScanExec
 Key: SPARK-3
 URL: https://issues.apache.org/jira/browse/SPARK-3
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


Use PartitionEvaluator API in CoGroupExec, DeserializeToObjectExec, 
ExternalRDDScanExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44428) Add test case for all PartitionEvaluator API

2023-07-14 Thread Jia Fan (Jira)

Jia Fan created SPARK-44428:
---

 Summary: Add test case for all PartitionEvaluator API
 Key: SPARK-44428
 URL: https://issues.apache.org/jira/browse/SPARK-44428
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


We need to add PartitionEvaluator API enabled use cases to existing SQL tests 
to ensure that all PartitionEvaluator API changes work correctly



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44427) Use PartitionEvaluator API in MapElementsExec, MapGroupsExec, MapPartitionsExec

2023-07-14 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743119#comment-17743119
 ] 

Jia Fan commented on SPARK-44427:
-

I'm working on this.

> Use PartitionEvaluator API in MapElementsExec, MapGroupsExec, 
> MapPartitionsExec
> ---
>
> Key: SPARK-44427
> URL: https://issues.apache.org/jira/browse/SPARK-44427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> Use PartitionEvaluator API in MapElementsExec, MapGroupsExec, 
> MapPartitionsExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44427) Use PartitionEvaluator API in MapElementsExec, MapGroupsExec, MapPartitionsExec

2023-07-14 Thread Jia Fan (Jira)

Jia Fan created SPARK-44427:
---

 Summary: Use PartitionEvaluator API in MapElementsExec, 
MapGroupsExec, MapPartitionsExec
 Key: SPARK-44427
 URL: https://issues.apache.org/jira/browse/SPARK-44427
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


Use PartitionEvaluator API in MapElementsExec, MapGroupsExec, MapPartitionsExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44162) Support G1GC in `spark.eventLog.gcMetrics.*` without warning

2023-07-14 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743112#comment-17743112
 ] 

Jia Fan commented on SPARK-44162:
-

https://github.com/apache/spark/pull/41808

> Support G1GC in `spark.eventLog.gcMetrics.*` without warning
> 
>
> Key: SPARK-44162
> URL: https://issues.apache.org/jira/browse/SPARK-44162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> >>> 23/06/23 14:26:53 WARN GarbageCollectionMetrics: To enable non-built-in 
> >>> garbage collector(s) List(G1 Concurrent GC), users should configure 
> >>> it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or 
> >>> spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44385) Use PartitionEvaluator API in MergingSessionsExec & UpdatingSessionsExec

2023-07-12 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742402#comment-17742402
 ] 

Jia Fan commented on SPARK-44385:
-

I'm working on it.

> Use PartitionEvaluator API in MergingSessionsExec & UpdatingSessionsExec
> 
>
> Key: SPARK-44385
> URL: https://issues.apache.org/jira/browse/SPARK-44385
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> Use PartitionEvaluator API in MergingSessionsExec & UpdatingSessionsExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44386) Use PartitionEvaluator API in HashAggregateExec, ObjectHashAggregateExec, SortAggregateExec

2023-07-12 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742403#comment-17742403
 ] 

Jia Fan commented on SPARK-44386:
-

I'm working on it.

> Use PartitionEvaluator API in HashAggregateExec, ObjectHashAggregateExec, 
> SortAggregateExec
> ---
>
> Key: SPARK-44386
> URL: https://issues.apache.org/jira/browse/SPARK-44386
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> Use PartitionEvaluator API in HashAggregateExec, ObjectHashAggregateExec, 
> SortAggregateExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44386) Use PartitionEvaluator API in HashAggregateExec, ObjectHashAggregateExec, SortAggregateExec

2023-07-12 Thread Jia Fan (Jira)

Jia Fan created SPARK-44386:
---

 Summary: Use PartitionEvaluator API in HashAggregateExec, 
ObjectHashAggregateExec, SortAggregateExec
 Key: SPARK-44386
 URL: https://issues.apache.org/jira/browse/SPARK-44386
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


Use PartitionEvaluator API in HashAggregateExec, ObjectHashAggregateExec, 
SortAggregateExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44385) Use PartitionEvaluator API in MergingSessionsExec & UpdatingSessionsExec

2023-07-12 Thread Jia Fan (Jira)

Jia Fan created SPARK-44385:
---

 Summary: Use PartitionEvaluator API in MergingSessionsExec & 
UpdatingSessionsExec
 Key: SPARK-44385
 URL: https://issues.apache.org/jira/browse/SPARK-44385
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


Use PartitionEvaluator API in MergingSessionsExec & UpdatingSessionsExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44375) Use PartitionEvaluator API in DebugExec

2023-07-11 Thread Jia Fan (Jira)

Jia Fan created SPARK-44375:
---

 Summary: Use PartitionEvaluator API in DebugExec
 Key: SPARK-44375
 URL: https://issues.apache.org/jira/browse/SPARK-44375
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


Use PartitionEvaluator API in DebugExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44375) Use PartitionEvaluator API in DebugExec

2023-07-11 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742004#comment-17742004
 ] 

Jia Fan commented on SPARK-44375:
-

I'm working on it.

> Use PartitionEvaluator API in DebugExec
> ---
>
> Key: SPARK-44375
> URL: https://issues.apache.org/jira/browse/SPARK-44375
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> Use PartitionEvaluator API in DebugExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44370) Migrate Buf remote generation alpha to remote plugins

2023-07-11 Thread Jia Fan (Jira)

Jia Fan created SPARK-44370:
---

 Summary: Migrate Buf remote generation alpha to remote plugins
 Key: SPARK-44370
 URL: https://issues.apache.org/jira/browse/SPARK-44370
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.4.1
Reporter: Jia Fan


Buf unsupported remote generation alpha at now. Please refer 
[https://buf.build/docs/migration-guides/migrate-remote-generation-alpha/] . We 
should migrate Buf remote generation alpha to remote plugins by follow the 
guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44262) JdbcUtils hardcodes some SQL statements

2023-07-03 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17739515#comment-17739515
 ] 

Jia Fan commented on SPARK-44262:
-

I got it. I will try to create a PR for this. Thanks for your explain.

> JdbcUtils hardcodes some SQL statements
> ---
>
> Key: SPARK-44262
> URL: https://issues.apache.org/jira/browse/SPARK-44262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Florent BIVILLE
>Priority: Major
>
> I am currently investigating an integration with the [Neo4j JBDC 
> driver|https://github.com/neo4j-contrib/neo4j-jdbc] and a Spark-based cloud 
> vendor SDK.
>  
> This SDK relies on Spark's {{JdbcUtils}} to run queries and insert data.
> While {{JdbcUtils}} partly delegates to 
> \{{org.apache.spark.sql.jdbc.JdbcDialect}} for some queries, some others are 
> hardcoded to SQL, see:
>  * {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#dropTable}}
>  * 
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#getInsertStatement}}
>  
> This works fine for relational databases but breaks for NOSQL stores that do 
> not support SQL translation (like Neo4j).
> Is there a plan to augment the {{JdbcDialect}} surface so that it is also 
> responsible for these currently-hardcoded queries?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44262) JdbcUtils hardcodes some SQL statements

2023-07-03 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17739501#comment-17739501
 ] 

Jia Fan commented on SPARK-44262:
-

Maybe you should check [https://github.com/neo4j-contrib/neo4j-spark-connector] 
. Or 
https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit

> JdbcUtils hardcodes some SQL statements
> ---
>
> Key: SPARK-44262
> URL: https://issues.apache.org/jira/browse/SPARK-44262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Florent BIVILLE
>Priority: Major
>
> I am currently investigating an integration with the [Neo4j JBDC 
> driver|https://github.com/neo4j-contrib/neo4j-jdbc] and a Spark-based cloud 
> vendor SDK.
>  
> This SDK relies on Spark's {{JdbcUtils}} to run queries and insert data.
> While {{JdbcUtils}} partly delegates to 
> \{{org.apache.spark.sql.jdbc.JdbcDialect}} for some queries, some others are 
> hardcoded to SQL, see:
>  * {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#dropTable}}
>  * 
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#getInsertStatement}}
>  
> This works fine for relational databases but breaks for NOSQL stores that do 
> not support SQL translation (like Neo4j).
> Is there a plan to augment the {{JdbcDialect}} surface so that it is also 
> responsible for these currently-hardcoded queries?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44262) JdbcUtils hardcodes some SQL statements

2023-07-01 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17739339#comment-17739339
 ] 

Jia Fan commented on SPARK-44262:
-

why not use datasoure v2?

> JdbcUtils hardcodes some SQL statements
> ---
>
> Key: SPARK-44262
> URL: https://issues.apache.org/jira/browse/SPARK-44262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Florent BIVILLE
>Priority: Major
>
> I am currently investigating an integration with the [Neo4j JBDC 
> driver|https://github.com/neo4j-contrib/neo4j-jdbc] and a Spark-based cloud 
> vendor SDK.
>  
> This SDK relies on Spark's {{JdbcUtils}} to run queries and insert data.
> While {{JdbcUtils}} partly delegates to 
> \{{org.apache.spark.sql.jdbc.JdbcDialect}} for some queries, some others are 
> hardcoded to SQL, see:
>  * {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#dropTable}}
>  * 
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#getInsertStatement}}
>  
> This works fine for relational databases but breaks for NOSQL stores that do 
> not support SQL translation (like Neo4j).
> Is there a plan to augment the {{JdbcDialect}} surface so that it is also 
> responsible for these currently-hardcoded queries?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44268) Add tests to ensure error-classes.json and docs are in sync

2023-07-01 Thread Jia Fan (Jira)

Jia Fan created SPARK-44268:
---

 Summary: Add tests to ensure error-classes.json and docs are in 
sync
 Key: SPARK-44268
 URL: https://issues.apache.org/jira/browse/SPARK-44268
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Jia Fan


We should add tests to ensure error-classes.json and docs are in sync, docs and 
error-classes.json are always up to date before the PR is committed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44236) Even `spark.sql.codegen.factoryMode` is NO_CODEGEN, the WholeStageCodegen also will be generated.

2023-06-28 Thread Jia Fan (Jira)

Jia Fan created SPARK-44236:
---

 Summary: Even `spark.sql.codegen.factoryMode` is NO_CODEGEN, the 
WholeStageCodegen also will be generated.
 Key: SPARK-44236
 URL: https://issues.apache.org/jira/browse/SPARK-44236
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1
Reporter: Jia Fan


The `spark.sql.codegen.factoryMode` is NO_CODEGEN, but Spark always generate 
WholeStageCodegen plan when set `spark.sql.codegen.wholeStage` to `true`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43999) Data is still fetched even though result was returned

2023-06-26 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737421#comment-17737421
 ] 

Jia Fan commented on SPARK-43999:
-

https://github.com/apache/spark/pull/41755

> Data is still fetched even though result was returned
> -
>
> Key: SPARK-43999
> URL: https://issues.apache.org/jira/browse/SPARK-43999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
> Environment: Production
>Reporter: Kamil Kliczbor
>Priority: Major
> Attachments: Profiler.PNG
>
>
> h2. Short problem description:
> I have two tables:
>  * tab1 is empty
>  * tab6 has milions of records
>  * when Spark returns results due to empty database table tab1, it still asks 
> for the tab6 data
> When I create the query that uses LEFT JOIN, the results are returned 
> immediately, however under the hood the huge table is requested to return the 
> results anyway.
> h2. Repro:
> h3. Prepare the MSSQL server database 
> 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a 
> named instance SQL2012.
> 2. Download and install 
> [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16]
>  (or any other tool) and run the following Query
> {code:sql}
> USE [master]
> GO
> CREATE DATABASE QueueSlots
> GO
> CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], 
> CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
> GO
> USE [QueueSlots]
> GO
> CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo]
> GO
> {code}
> 3. Then create the tables and fill the tab6 with the data:
> {code:sql}
> CREATE TABLE tab1 (Id INT, Name NVARCHAR(50))
> CREATE TABLE tab6 (Id INT, Name NVARCHAR(50))
> insert into tab6
> select o1.object_id as Id , o1.name as Name
> from sys.objects as o1 
> cross join sys.objects as o2
> cross join sys.objects as o3
> cross join sys.objects as o4
> -- it might be required to increase the numer of the cross joins to increase 
> the number of the rows, approximately 1 mln is enough - select should take 
> several seconds
> {code}
> h3. Prepare Spark
>  # Download mssql jdbc driver in version 12.2.0
>  # Run spark-shell2.cmd with the settings -cp 
> "/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar"
> h3. Create temporary views on Spark
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab1
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab1',
>   user 'spark',
>   password 'spark'
> )
> """)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab6
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab6',
>   user 'spark',
>   password 'spark'
> )
> """)
> {code}
> h3. Enable SQL Server Profiler tracing
>  # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). 
> Create new trace to the "QueueSlots" database. Use filtering options to see 
> only queries issued for that database (Events Selection tab -> check Show all 
> events and Show all columns, then click Column Filters -> DatabaseName like 
> QueueSlots).
>  # Run the trace
> h3. Run the query in Spark console
>  # Run the following query
> {code:java}
> sqlContext.sql("""
> SELECT t1.Id, t1.Name, t6.Name
>   FROM tab1 as t1
>   LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id
> """).show
> {code}
> The results are returned immediately as:
> {code:java}
> +---+++
> | Id|Name|Name|
> +---+++
> +---+++
> [Stage 63:> (0 + 1) / 
> 1]
> {code}
> h3. {color:#00875a}Expected{color}
> As the results are returned immediately for empty table, another sources are 
> not queried.
> h3. {color:#de350b}Given:{color}
> The table6 is requested to return the data even though it is not being used 
> and it is CPU and IO consuming operation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43999) Data is still fetched even though result was returned

2023-06-26 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737026#comment-17737026
 ] 

Jia Fan edited comment on SPARK-43999 at 6/26/23 10:50 AM:
---

The reason are when AQE on, the small empty table alway return faster when join 
on two table. When get the left result, the AQE optimizer will use stats to 
reOptimize plan, so the right table result will be unnecessary. The result 
return, but the right table query stage will not be cancel or force finish at 
now. 


was (Author: fanjia):
The reason are when AQE on, the small empty table alway return faster when join 
on two table. When get the left result, the AQE optimizer will use stats to 
reOptimize plan, so the right table result will be unnecessary. The result 
return, but the right table query stage will not be cancel or forch finish at 
now. 

> Data is still fetched even though result was returned
> -
>
> Key: SPARK-43999
> URL: https://issues.apache.org/jira/browse/SPARK-43999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
> Environment: Production
>Reporter: Kamil Kliczbor
>Priority: Major
> Attachments: Profiler.PNG
>
>
> h2. Short problem description:
> I have two tables:
>  * tab1 is empty
>  * tab6 has milions of records
>  * when Spark returns results due to empty database table tab1, it still asks 
> for the tab6 data
> When I create the query that uses LEFT JOIN, the results are returned 
> immediately, however under the hood the huge table is requested to return the 
> results anyway.
> h2. Repro:
> h3. Prepare the MSSQL server database 
> 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a 
> named instance SQL2012.
> 2. Download and install 
> [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16]
>  (or any other tool) and run the following Query
> {code:sql}
> USE [master]
> GO
> CREATE DATABASE QueueSlots
> GO
> CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], 
> CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
> GO
> USE [QueueSlots]
> GO
> CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo]
> GO
> {code}
> 3. Then create the tables and fill the tab6 with the data:
> {code:sql}
> CREATE TABLE tab1 (Id INT, Name NVARCHAR(50))
> CREATE TABLE tab6 (Id INT, Name NVARCHAR(50))
> insert into tab6
> select o1.object_id as Id , o1.name as Name
> from sys.objects as o1 
> cross join sys.objects as o2
> cross join sys.objects as o3
> cross join sys.objects as o4
> -- it might be required to increase the numer of the cross joins to increase 
> the number of the rows, approximately 1 mln is enough - select should take 
> several seconds
> {code}
> h3. Prepare Spark
>  # Download mssql jdbc driver in version 12.2.0
>  # Run spark-shell2.cmd with the settings -cp 
> "/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar"
> h3. Create temporary views on Spark
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab1
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab1',
>   user 'spark',
>   password 'spark'
> )
> """)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab6
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab6',
>   user 'spark',
>   password 'spark'
> )
> """)
> {code}
> h3. Enable SQL Server Profiler tracing
>  # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). 
> Create new trace to the "QueueSlots" database. Use filtering options to see 
> only queries issued for that database (Events Selection tab -> check Show all 
> events and Show all columns, then click Column Filters -> DatabaseName like 
> QueueSlots).
>  # Run the trace
> h3. Run the query in Spark console
>  # Run the following query
> {code:java}
> sqlContext.sql("""
> SELECT t1.Id, t1.Name, t6.Name
>   FROM tab1 as t1
>   LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id
> """).show
> {code}
> The results are returned immediately as:
> {code:java}
> +---+++
> | Id|Name|Name|
> +---+++
> +---+++
> [Stage 63:> (0 + 1) / 
> 1]
> {code}
> h3. {color:#00875a}Expected{color}
> As the results are returned immediately for empty table, another sources are 
> not queried.
>

[jira] [Commented] (SPARK-43999) Data is still fetched even though result was returned

2023-06-26 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737045#comment-17737045
 ] 

Jia Fan commented on SPARK-43999:
-

> When do you plan to fix this behaviour?

 

I'm doing this.

> Data is still fetched even though result was returned
> -
>
> Key: SPARK-43999
> URL: https://issues.apache.org/jira/browse/SPARK-43999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
> Environment: Production
>Reporter: Kamil Kliczbor
>Priority: Major
> Attachments: Profiler.PNG
>
>
> h2. Short problem description:
> I have two tables:
>  * tab1 is empty
>  * tab6 has milions of records
>  * when Spark returns results due to empty database table tab1, it still asks 
> for the tab6 data
> When I create the query that uses LEFT JOIN, the results are returned 
> immediately, however under the hood the huge table is requested to return the 
> results anyway.
> h2. Repro:
> h3. Prepare the MSSQL server database 
> 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a 
> named instance SQL2012.
> 2. Download and install 
> [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16]
>  (or any other tool) and run the following Query
> {code:sql}
> USE [master]
> GO
> CREATE DATABASE QueueSlots
> GO
> CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], 
> CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
> GO
> USE [QueueSlots]
> GO
> CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo]
> GO
> {code}
> 3. Then create the tables and fill the tab6 with the data:
> {code:sql}
> CREATE TABLE tab1 (Id INT, Name NVARCHAR(50))
> CREATE TABLE tab6 (Id INT, Name NVARCHAR(50))
> insert into tab6
> select o1.object_id as Id , o1.name as Name
> from sys.objects as o1 
> cross join sys.objects as o2
> cross join sys.objects as o3
> cross join sys.objects as o4
> -- it might be required to increase the numer of the cross joins to increase 
> the number of the rows, approximately 1 mln is enough - select should take 
> several seconds
> {code}
> h3. Prepare Spark
>  # Download mssql jdbc driver in version 12.2.0
>  # Run spark-shell2.cmd with the settings -cp 
> "/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar"
> h3. Create temporary views on Spark
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab1
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab1',
>   user 'spark',
>   password 'spark'
> )
> """)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab6
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab6',
>   user 'spark',
>   password 'spark'
> )
> """)
> {code}
> h3. Enable SQL Server Profiler tracing
>  # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). 
> Create new trace to the "QueueSlots" database. Use filtering options to see 
> only queries issued for that database (Events Selection tab -> check Show all 
> events and Show all columns, then click Column Filters -> DatabaseName like 
> QueueSlots).
>  # Run the trace
> h3. Run the query in Spark console
>  # Run the following query
> {code:java}
> sqlContext.sql("""
> SELECT t1.Id, t1.Name, t6.Name
>   FROM tab1 as t1
>   LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id
> """).show
> {code}
> The results are returned immediately as:
> {code:java}
> +---+++
> | Id|Name|Name|
> +---+++
> +---+++
> [Stage 63:> (0 + 1) / 
> 1]
> {code}
> h3. {color:#00875a}Expected{color}
> As the results are returned immediately for empty table, another sources are 
> not queried.
> h3. {color:#de350b}Given:{color}
> The table6 is requested to return the data even though it is not being used 
> and it is CPU and IO consuming operation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43999) Data is still fetched even though result was returned

2023-06-26 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17737026#comment-17737026
 ] 

Jia Fan commented on SPARK-43999:
-

The reason are when AQE on, the small empty table alway return faster when join 
on two table. When get the left result, the AQE optimizer will use stats to 
reOptimize plan, so the right table result will be unnecessary. The result 
return, but the right table query stage will not be cancel or forch finish at 
now. 

> Data is still fetched even though result was returned
> -
>
> Key: SPARK-43999
> URL: https://issues.apache.org/jira/browse/SPARK-43999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
> Environment: Production
>Reporter: Kamil Kliczbor
>Priority: Major
> Attachments: Profiler.PNG
>
>
> h2. Short problem description:
> I have two tables:
>  * tab1 is empty
>  * tab6 has milions of records
>  * when Spark returns results due to empty database table tab1, it still asks 
> for the tab6 data
> When I create the query that uses LEFT JOIN, the results are returned 
> immediately, however under the hood the huge table is requested to return the 
> results anyway.
> h2. Repro:
> h3. Prepare the MSSQL server database 
> 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a 
> named instance SQL2012.
> 2. Download and install 
> [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16]
>  (or any other tool) and run the following Query
> {code:sql}
> USE [master]
> GO
> CREATE DATABASE QueueSlots
> GO
> CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], 
> CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
> GO
> USE [QueueSlots]
> GO
> CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo]
> GO
> {code}
> 3. Then create the tables and fill the tab6 with the data:
> {code:sql}
> CREATE TABLE tab1 (Id INT, Name NVARCHAR(50))
> CREATE TABLE tab6 (Id INT, Name NVARCHAR(50))
> insert into tab6
> select o1.object_id as Id , o1.name as Name
> from sys.objects as o1 
> cross join sys.objects as o2
> cross join sys.objects as o3
> cross join sys.objects as o4
> -- it might be required to increase the numer of the cross joins to increase 
> the number of the rows, approximately 1 mln is enough - select should take 
> several seconds
> {code}
> h3. Prepare Spark
>  # Download mssql jdbc driver in version 12.2.0
>  # Run spark-shell2.cmd with the settings -cp 
> "/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar"
> h3. Create temporary views on Spark
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab1
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab1',
>   user 'spark',
>   password 'spark'
> )
> """)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab6
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab6',
>   user 'spark',
>   password 'spark'
> )
> """)
> {code}
> h3. Enable SQL Server Profiler tracing
>  # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). 
> Create new trace to the "QueueSlots" database. Use filtering options to see 
> only queries issued for that database (Events Selection tab -> check Show all 
> events and Show all columns, then click Column Filters -> DatabaseName like 
> QueueSlots).
>  # Run the trace
> h3. Run the query in Spark console
>  # Run the following query
> {code:java}
> sqlContext.sql("""
> SELECT t1.Id, t1.Name, t6.Name
>   FROM tab1 as t1
>   LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id
> """).show
> {code}
> The results are returned immediately as:
> {code:java}
> +---+++
> | Id|Name|Name|
> +---+++
> +---+++
> [Stage 63:> (0 + 1) / 
> 1]
> {code}
> h3. {color:#00875a}Expected{color}
> As the results are returned immediately for empty table, another sources are 
> not queried.
> h3. {color:#de350b}Given:{color}
> The table6 is requested to return the data even though it is not being used 
> and it is CPU and IO consuming operation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Created] (SPARK-44188) Remove useless code `resetAllPartitions` in ActiveJob

2023-06-26 Thread Jia Fan (Jira)

Jia Fan created SPARK-44188:
---

 Summary: Remove useless code `resetAllPartitions` in ActiveJob
 Key: SPARK-44188
 URL: https://issues.apache.org/jira/browse/SPARK-44188
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Jia Fan


In class ActiveJob have useless method `resetAllPartitions`. We should remove 
it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43201) Inconsistency between from_avro and from_json function

2023-06-15 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733281#comment-17733281
 ] 

Jia Fan commented on SPARK-43201:
-

If avroSchema1 not equals avroSchema2, the dataframe's schema would not match 
for each row. This will be a problem.

> Inconsistency between from_avro and from_json function
> --
>
> Key: SPARK-43201
> URL: https://issues.apache.org/jira/browse/SPARK-43201
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Philip Adetiloye
>Priority: Major
>
> Spark from_avro function does not allow schema parameter to use dataframe 
> column but takes only a String schema:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: String): Column {code}
> This makes it impossible to deserialize rows of Avro records with different 
> schema since only one schema string could be pass externally. 
>  
> Here is what I would expect like from_json function:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
> code example:
> {code:java}
> import org.apache.spark.sql.functions.from_avro
> val avroSchema1 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
>  
> val avroSchema2 = 
> """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
> val df = Seq(
>   (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
>   (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
> ).toDF("binaryData", "schema")
> val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))
> parsed.show()
> // Output:
> // ++
> // |  parsedData|
> // ++
> // |[apple1, 1.0]|
> // |[apple2, 2.0]|
> // ++
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43486) number of files read is incorrect if it is bucket table

2023-06-13 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732320#comment-17732320
 ] 

Jia Fan commented on SPARK-43486:
-

I didn't reproduce it too.:(

> number of files read is incorrect if it is bucket table
> ---
>
> Key: SPARK-43486
> URL: https://issues.apache.org/jira/browse/SPARK-43486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43486) number of files read is incorrect if it is bucket table

2023-06-13 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732045#comment-17732045
 ] 

Jia Fan commented on SPARK-43486:
-

[~panbingkun] Hi, any update for this?

> number of files read is incorrect if it is bucket table
> ---
>
> Key: SPARK-43486
> URL: https://issues.apache.org/jira/browse/SPARK-43486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43891) Support SHOW VIEWS IN . when not is not the current selected catalog

2023-06-13 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732031#comment-17732031
 ] 

Jia Fan commented on SPARK-43891:
-

cc [~cloud_fan] [~dongjoon] 

> Support SHOW VIEWS IN . when not  is not the 
> current selected catalog
> ---
>
> Key: SPARK-43891
> URL: https://issues.apache.org/jira/browse/SPARK-43891
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-43891) Support SHOW VIEWS IN . when not is not the current selected catalog

2023-06-13 Thread Jia Fan (Jira)



[ https://issues.apache.org/jira/browse/SPARK-43891 ]


Jia Fan deleted comment on SPARK-43891:
-

was (Author: fanjia):
I can work for this.

> Support SHOW VIEWS IN . when not  is not the 
> current selected catalog
> ---
>
> Key: SPARK-43891
> URL: https://issues.apache.org/jira/browse/SPARK-43891
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43891) Support SHOW VIEWS IN . when not is not the current selected catalog

2023-06-13 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732030#comment-17732030
 ] 

Jia Fan commented on SPARK-43891:
-

Hi [~amaliujia] , I have a question about view, I find Spark add ViewCatalog 
for DataSourceV2, but we never use it(Can't create view through ViewCatalog at 
now). In my view, this ticket will be implement on DataSourceV2, so we can view 
different catalog view. But we don't support create view, what's the meaning of 
show view?

> Support SHOW VIEWS IN . when not  is not the 
> current selected catalog
> ---
>
> Key: SPARK-43891
> URL: https://issues.apache.org/jira/browse/SPARK-43891
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43753) Incorrect result of MINUS in spark sql.

2023-06-13 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732012#comment-17732012
 ] 

Jia Fan commented on SPARK-43753:
-

Seem already fixed on the master branch.

> Incorrect result of MINUS in spark sql.
> ---
>
> Key: SPARK-43753
> URL: https://issues.apache.org/jira/browse/SPARK-43753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.3, 3.1.3
>Reporter: Kernel Force
>Priority: Major
>
> sql("""
> with va as (
>   select '123' id, 'a' name
>    union all
>   select '123' id, 'b' name
> )
> select '123' id, 'a' name from va t where t.name = 'a'
>  minus 
> select '123' id, 'a' name from va s where s.name = 'b'
> """).show
> +---++
> | id|name|
> +---++
> |123|   a|
> +---++
> which is expected to be empty result set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43891) Support SHOW VIEWS IN . when not is not the current selected catalog

2023-06-13 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732006#comment-17732006
 ] 

Jia Fan commented on SPARK-43891:
-

I can work for this.

> Support SHOW VIEWS IN . when not  is not the 
> current selected catalog
> ---
>
> Key: SPARK-43891
> URL: https://issues.apache.org/jira/browse/SPARK-43891
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43781) IllegalStateException when cogrouping two datasets derived from the same source

2023-06-12 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-43781:

Affects Version/s: 3.4.0

> IllegalStateException when cogrouping two datasets derived from the same 
> source
> ---
>
> Key: SPARK-43781
> URL: https://issues.apache.org/jira/browse/SPARK-43781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.4.0
> Environment: Reproduces in a unit test, using Spark 3.3.1, the Java 
> API, and a {{local[2]}} SparkSession.
>Reporter: Derek Murray
>Priority: Major
>
> Attempting to {{cogroup}} two datasets derived from the same source dataset 
> yields an {{IllegalStateException}} when the query is executed.
> Minimal reproducer:
> {code:java}
> StructType inputType = DataTypes.createStructType(
> new StructField[]{
> DataTypes.createStructField("id", DataTypes.LongType, false),
> DataTypes.createStructField("type", DataTypes.StringType, false)
> }
> );
> StructType keyType = DataTypes.createStructType(
> new StructField[]{
> DataTypes.createStructField("id", DataTypes.LongType, false)
> }
> );
> List inputRows = new ArrayList<>();
> inputRows.add(RowFactory.create(1L, "foo"));
> inputRows.add(RowFactory.create(1L, "bar"));
> inputRows.add(RowFactory.create(2L, "foo"));
> Dataset input = sparkSession.createDataFrame(inputRows, inputType);
> KeyValueGroupedDataset fooGroups = input
> .filter("type = 'foo'")
> .groupBy("id")
> .as(RowEncoder.apply(keyType), RowEncoder.apply(inputType));
> KeyValueGroupedDataset barGroups = input
> .filter("type = 'bar'")
> .groupBy("id")
> .as(RowEncoder.apply(keyType), RowEncoder.apply(inputType));
> Dataset result = fooGroups.cogroup(
> barGroups,
> (CoGroupFunction) (row, iterator, iterator1) -> new 
> ArrayList().iterator(),
> RowEncoder.apply(inputType));
> result.explain();
> result.show();{code}
> Explain output (note mismatch in column IDs between Sort/Exchagne and 
> LocalTableScan on the first input to the CoGroup):
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- SerializeFromObject 
> [validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, id), LongType, false) AS id#37L, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 1, type), StringType, false), true, false, 
> true) AS type#38]
>    +- CoGroup 
> org.apache.spark.sql.KeyValueGroupedDataset$$Lambda$1478/1869116781@77856cc5, 
> createexternalrow(id#16L, StructField(id,LongType,false)), 
> createexternalrow(id#16L, type#17.toString, StructField(id,LongType,false), 
> StructField(type,StringType,false)), createexternalrow(id#16L, 
> type#17.toString, StructField(id,LongType,false), 
> StructField(type,StringType,false)), [id#39L], [id#39L], [id#39L, type#40], 
> [id#39L, type#40], obj#36: org.apache.spark.sql.Row
>       :- !Sort [id#39L ASC NULLS FIRST], false, 0
>       :  +- !Exchange hashpartitioning(id#39L, 2), ENSURE_REQUIREMENTS, 
> [plan_id=19]
>       :     +- LocalTableScan [id#16L, type#17]
>       +- Sort [id#39L ASC NULLS FIRST], false, 0
>          +- Exchange hashpartitioning(id#39L, 2), ENSURE_REQUIREMENTS, 
> [plan_id=20]
>             +- LocalTableScan [id#39L, type#40]{code}
> Exception:
> {code:java}
> java.lang.IllegalStateException: Couldn't find id#39L in [id#16L,type#17]
>         at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>         at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>         at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>         at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
>         at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>         at 
>

[jira] [Commented] (SPARK-42290) Spark Driver hangs on OOM during Broadcast when AQE is enabled

2023-06-11 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17731331#comment-17731331
 ] 

Jia Fan commented on SPARK-42290:
-

Thanks [~dongjoon] 

> Spark Driver hangs on OOM during Broadcast when AQE is enabled 
> ---
>
> Key: SPARK-42290
> URL: https://issues.apache.org/jira/browse/SPARK-42290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Shardul Mahadik
>Assignee: Jia Fan
>Priority: Critical
> Fix For: 3.4.1, 3.5.0
>
>
> Repro steps:
> {code}
> $ spark-shell --conf spark.driver.memory=1g
> val df = spark.range(500).withColumn("str", 
> lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
> val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")
> df2.collect
> {code}
> This will cause the driver to hang indefinitely. Heres a thread dump of the 
> {{main}} thread when its stuck
> {code}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:285)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$Lambda$2819/629294880.apply(Unknown
>  Source)
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:809)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:236)
>  => holding Monitor(java.lang.Object@1932537396})
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:381)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:354)
> org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4179)
> org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3420)
> org.apache.spark.sql.Dataset$$Lambda$2390/1803372144.apply(Unknown Source)
> org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4169)
> org.apache.spark.sql.Dataset$$Lambda$2791/1357377136.apply(Unknown Source)
> org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4167)
> org.apache.spark.sql.Dataset$$Lambda$2391/1172042998.apply(Unknown Source)
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
> org.apache.spark.sql.execution.SQLExecution$$$Lambda$2402/721269425.apply(Unknown
>  Source)
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
> org.apache.spark.sql.execution.SQLExecution$$$Lambda$2392/11632488.apply(Unknown
>  Source)
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:809)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
> org.apache.spark.sql.Dataset.withAction(Dataset.scala:4167)
> org.apache.spark.sql.Dataset.collect(Dataset.scala:3420)
> {code}
> When we disable AQE though we get the following exception instead of driver 
> hang.
> {code}
> Caused by: org.apache.spark.SparkException: Not enough memory to build and 
> broadcast the table to all worker nodes. As a workaround, you can either 
> disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or 
> increase the spark driver memory by setting spark.driver.memory to a higher 
> value.
>   ... 7 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:834)
>   at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:777)
>   at 
> org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:1086)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:157)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:1163)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:1151)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:148)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$Lambda$2999/145945436.apply(Unknown
>  Source)
>   at 
>

[jira] [Updated] (SPARK-40637) Spark-shell can correctly encode BINARY type but Spark-sql cannot

2023-06-08 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-40637:

Affects Version/s: 3.4.0

> Spark-shell can correctly encode BINARY type but Spark-sql cannot
> -
>
> Key: SPARK-40637
> URL: https://issues.apache.org/jira/browse/SPARK-40637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1, 3.4.0
>Reporter: xsys
>Priority: Major
> Attachments: image-2022-10-18-12-15-05-576.png
>
>
> h3. Describe the bug
> When we store a BINARY value (e.g. {{BigInt("1").toByteArray)}} / 
> {{{}X'01'{}}}) either via {{spark-shell or spark-sql, and then read it from 
> Spark-shell, it}} outputs {{{}[01]{}}}. However, it does not encode correctly 
> when querying it via {{{}spark-sql{}}}.
> i.e.,
> Insert via spark-shell, read via spark-shell: display correctly
> Insert via spark-shell, read via spark-sql: does not display correctly
> Insert via spark-sql, read via spark-sql: does not display correctly
> Insert via spark-sql, read via spark-shell: display correctly
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> import org.apache.spark.sql.Row 
> scala> import org.apache.spark.sql.types._
> scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> ParallelCollectionRDD[356] at parallelize at :28
> scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(c1,BinaryType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: binary]
> scala> 
> df.write.mode("overwrite").format("orc").saveAsTable("binary_vals_shell")
> scala> spark.sql("select * from binary_vals_shell;").show(false)
> ++
> |c1  |
> ++
> |[01]|
> ++{code}
> Then using {{spark-sql}} to (1) query what is inserted via spark-shell to the 
> binary_vals_shell table, and then (2) insert the value via spark-sql to the 
> binary_vals_sql table (we use tee to redirect the log to a file)
> {code:java}
> $SPARK_HOME/bin/spark-sql | tee sql.log{code}
>  Execute the following, we only get an empty output in the terminal (but a 
> garbage character in the log file):
> {code:java}
> spark-sql> select * from binary_vals_shell; -- query what is inserted via 
> spark-shell;
> spark-sql> create table binary_vals_sql(c1 BINARY) stored as ORC; 
> spark-sql> insert into binary_vals_sql select X'01'; -- try to insert 
> directly in spark-sql;
> spark-sql> select * from binary_vals_sql;
> Time taken: 0.077 seconds, Fetched 1 row(s)
> {code}
> From the log file, we find it shows as a garbage character. (We never 
> encountered this garbage character in logs of other data types)
> h3. !image-2022-10-18-12-15-05-576.png!
> We then return to spark-shell again and run the following:
> {code:java}
> scala> spark.sql("select * from binary_vals_sql;").show(false)
> ++                                                                        
>   
> |c1  |
> ++
> |[01]|
> ++{code}
> The binary value does not display correctly via spark-sql, it still displays 
> correctly via spark-shell.
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) 
> to behave consistently for the same data type ({{{}BINARY{}}}) & input 
> ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.
>  
> h3. Additional context
> We also tried Avro and Parquet and encountered the same issue. We believe 
> this is format-independent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42290) Spark Driver hangs on OOM during Broadcast when AQE is enabled

2023-06-08 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730775#comment-17730775
 ] 

Jia Fan commented on SPARK-42290:
-

[~dongjoon] Seem the assigning people not right.

> Spark Driver hangs on OOM during Broadcast when AQE is enabled 
> ---
>
> Key: SPARK-42290
> URL: https://issues.apache.org/jira/browse/SPARK-42290
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Shardul Mahadik
>Assignee: Shardul Mahadik
>Priority: Critical
> Fix For: 3.4.1, 3.5.0
>
>
> Repro steps:
> {code}
> $ spark-shell --conf spark.driver.memory=1g
> val df = spark.range(500).withColumn("str", 
> lit("abcdabcdabcdabcdabasgasdfsadfasdfasdfasfasfsadfasdfsadfasdf"))
> val df2 = spark.range(10).join(broadcast(df), Seq("id"), "left_outer")
> df2.collect
> {code}
> This will cause the driver to hang indefinitely. Heres a thread dump of the 
> {{main}} thread when its stuck
> {code}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:285)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$$Lambda$2819/629294880.apply(Unknown
>  Source)
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:809)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:236)
>  => holding Monitor(java.lang.Object@1932537396})
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:381)
> org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:354)
> org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4179)
> org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:3420)
> org.apache.spark.sql.Dataset$$Lambda$2390/1803372144.apply(Unknown Source)
> org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4169)
> org.apache.spark.sql.Dataset$$Lambda$2791/1357377136.apply(Unknown Source)
> org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4167)
> org.apache.spark.sql.Dataset$$Lambda$2391/1172042998.apply(Unknown Source)
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
> org.apache.spark.sql.execution.SQLExecution$$$Lambda$2402/721269425.apply(Unknown
>  Source)
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
> org.apache.spark.sql.execution.SQLExecution$$$Lambda$2392/11632488.apply(Unknown
>  Source)
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:809)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
> org.apache.spark.sql.Dataset.withAction(Dataset.scala:4167)
> org.apache.spark.sql.Dataset.collect(Dataset.scala:3420)
> {code}
> When we disable AQE though we get the following exception instead of driver 
> hang.
> {code}
> Caused by: org.apache.spark.SparkException: Not enough memory to build and 
> broadcast the table to all worker nodes. As a workaround, you can either 
> disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or 
> increase the spark driver memory by setting spark.driver.memory to a higher 
> value.
>   ... 7 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:834)
>   at 
> org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:777)
>   at 
> org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:1086)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:157)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:1163)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:1151)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:148)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$Lambda$2999/145945436.apply(Unknown
>  Source)
>   at 
>

[jira] [Commented] (SPARK-43203) Fix DROP table behavior in session catalog

2023-05-30 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727795#comment-17727795
 ] 

Jia Fan commented on SPARK-43203:
-

https://github.com/apache/spark/pull/41348

> Fix DROP table behavior in session catalog
> --
>
> Key: SPARK-43203
> URL: https://issues.apache.org/jira/browse/SPARK-43203
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> DROP table behavior is not working correctly in 3.4.0 because we always 
> invoke V1 drop logic if the identifier looks like a V1 identifier. This is a 
> big blocker for external data sources that provide custom session catalogs.
> See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for 
> details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43521) Support CREATE TABLE LIKE FILE for PARQUET

2023-05-30 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727793#comment-17727793
 ] 

Jia Fan commented on SPARK-43521:
-

https://github.com/apache/spark/pull/41251

> Support CREATE TABLE LIKE FILE for PARQUET
> --
>
> Key: SPARK-43521
> URL: https://issues.apache.org/jira/browse/SPARK-43521
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> ref: https://issues.apache.org/jira/browse/HIVE-26395



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43838) Subquery on single table with having clause can't be optimized

2023-05-30 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727794#comment-17727794
 ] 

Jia Fan commented on SPARK-43838:
-

https://github.com/apache/spark/pull/41347

> Subquery on single table with having clause can't be optimized
> --
>
> Key: SPARK-43838
> URL: https://issues.apache.org/jira/browse/SPARK-43838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jia Fan
>Priority: Major
>
> Eg:
> {code:java}
> sql("create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)")
> sql("select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 " +
> "having cnt = 0) from t t1").show() {code}
> The error will throw:
> {code:java}
> [PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule 
> org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery in 
> batch Operator Optimization before Inferring Filters generated an invalid 
> plan: The plan becomes unresolved: 'Project [toprettystring(c1#224, 
> Some(America/Los_Angeles)) AS toprettystring(c1)#238, toprettystring(c2#225, 
> Some(America/Los_Angeles)) AS toprettystring(c2)#239, 
> toprettystring(cnt#246L, Some(America/Los_Angeles)) AS 
> toprettystring(scalarsubquery(c1))#240]
> +- 'Project [c1#224, c2#225, CASE WHEN isnull(alwaysTrue#245) THEN 0 WHEN NOT 
> (cnt#222L = 0) THEN null ELSE cnt#222L END AS cnt#246L]
>    +- 'Join LeftOuter, (c1#224 = c1#224#244)
>       :- Project [col1#226 AS c1#224, col2#227 AS c2#225]
>       :  +- LocalRelation [col1#226, col2#227]
>       +- Project [cnt#222L, c1#224#244, cnt#222L, c1#224, true AS 
> alwaysTrue#245]
>          +- Project [cnt#222L, c1#224 AS c1#224#244, cnt#222L, c1#224]
>             +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
>                +- Project [col1#228 AS c1#224]
>                   +- LocalRelation [col1#228, col2#229]The previous plan: 
> Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS 
> toprettystring(c1)#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS 
> toprettystring(c2)#239, toprettystring(scalar-subquery#223 [c1#224 && (c1#224 
> = c1#224#244)], Some(America/Los_Angeles)) AS 
> toprettystring(scalarsubquery(c1))#240]
> :  +- Project [cnt#222L, c1#224 AS c1#224#244]
> :     +- Filter (cnt#222L = 0)
> :        +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
> :           +- Project [col1#228 AS c1#224]
> :              +- LocalRelation [col1#228, col2#229]
> +- Project [col1#226 AS c1#224, col2#227 AS c2#225]
>    +- LocalRelation [col1#226, col2#227] {code}
>  
> The reason are when execute subquery decorrelation, the fields in the 
> subquery but not in having clause are wrongly pull up. This problem only 
> occurs when there contain having clause.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43882) Assign a name to the error class _LEGACY_ERROR_TEMP_2122

2023-05-30 Thread Jia Fan (Jira)

Jia Fan created SPARK-43882:
---

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_2122
 Key: SPARK-43882
 URL: https://issues.apache.org/jira/browse/SPARK-43882
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jia Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43805) Support SELECT * EXCEPT AND SELECT * REPLACE

2023-05-28 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17726627#comment-17726627
 ] 

Jia Fan edited comment on SPARK-43805 at 5/29/23 12:55 AM:
---

Tell the truth, I don't know will spark accept this statement? It doesn't look 
like standard sql. cc [~cloud_fan] [~dongjoon] 


was (Author: fanjia):
Tell the truth, I'm don't know will spark accept this statement? It doesn't 
look like standard sql. cc [~cloud_fan] [~dongjoon] 

> Support SELECT * EXCEPT AND  SELECT * REPLACE
> -
>
> Key: SPARK-43805
> URL: https://issues.apache.org/jira/browse/SPARK-43805
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> ref: 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_except]
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_replace
> [~fanjia] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43838) Subquery on single table with having clause can't be optimized

2023-05-27 Thread Jia Fan (Jira)

Jia Fan created SPARK-43838:
---

 Summary: Subquery on single table with having clause can't be 
optimized
 Key: SPARK-43838
 URL: https://issues.apache.org/jira/browse/SPARK-43838
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jia Fan


Eg:
{code:java}
sql("create view t(c1, c2) as values (0, 1), (0, 2), (1, 2)")

sql("select c1, c2, (select count(*) cnt from t t2 where t1.c1 = t2.c1 " +
"having cnt = 0) from t t1").show() {code}
The error will throw:
{code:java}
[PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule 
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery in 
batch Operator Optimization before Inferring Filters generated an invalid plan: 
The plan becomes unresolved: 'Project [toprettystring(c1#224, 
Some(America/Los_Angeles)) AS toprettystring(c1)#238, toprettystring(c2#225, 
Some(America/Los_Angeles)) AS toprettystring(c2)#239, toprettystring(cnt#246L, 
Some(America/Los_Angeles)) AS toprettystring(scalarsubquery(c1))#240]
+- 'Project [c1#224, c2#225, CASE WHEN isnull(alwaysTrue#245) THEN 0 WHEN NOT 
(cnt#222L = 0) THEN null ELSE cnt#222L END AS cnt#246L]
   +- 'Join LeftOuter, (c1#224 = c1#224#244)
      :- Project [col1#226 AS c1#224, col2#227 AS c2#225]
      :  +- LocalRelation [col1#226, col2#227]
      +- Project [cnt#222L, c1#224#244, cnt#222L, c1#224, true AS 
alwaysTrue#245]
         +- Project [cnt#222L, c1#224 AS c1#224#244, cnt#222L, c1#224]
            +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
               +- Project [col1#228 AS c1#224]
                  +- LocalRelation [col1#228, col2#229]The previous plan: 
Project [toprettystring(c1#224, Some(America/Los_Angeles)) AS 
toprettystring(c1)#238, toprettystring(c2#225, Some(America/Los_Angeles)) AS 
toprettystring(c2)#239, toprettystring(scalar-subquery#223 [c1#224 && (c1#224 = 
c1#224#244)], Some(America/Los_Angeles)) AS 
toprettystring(scalarsubquery(c1))#240]
:  +- Project [cnt#222L, c1#224 AS c1#224#244]
:     +- Filter (cnt#222L = 0)
:        +- Aggregate [c1#224], [count(1) AS cnt#222L, c1#224]
:           +- Project [col1#228 AS c1#224]
:              +- LocalRelation [col1#228, col2#229]
+- Project [col1#226 AS c1#224, col2#227 AS c2#225]
   +- LocalRelation [col1#226, col2#227] {code}
 

The reason are when execute subquery decorrelation, the fields in the subquery 
but not in having clause are wrongly pull up. This problem only occurs when 
there contain having clause.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43805) Support SELECT * EXCEPT AND SELECT * REPLACE

2023-05-26 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17726627#comment-17726627
 ] 

Jia Fan commented on SPARK-43805:
-

Tell the truth, I'm don't know will spark accept this statement? It doesn't 
look like standard sql. cc [~cloud_fan] [~dongjoon] 

> Support SELECT * EXCEPT AND  SELECT * REPLACE
> -
>
> Key: SPARK-43805
> URL: https://issues.apache.org/jira/browse/SPARK-43805
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> ref: 
> [https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_except]
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_replace
> [~fanjia] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value

2023-05-22 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724939#comment-17724939
 ] 

Jia Fan commented on SPARK-43338:
-

`Assign each hms a unique catalogname only so that the meta tableId is unique: 
catalog.database.table.`

 

I think the Datasource V2 can do that. But i didn't verify it.

> Support  modify the SESSION_CATALOG_NAME value
> --
>
> Key: SPARK-43338
> URL: https://issues.apache.org/jira/browse/SPARK-43338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> {code:java}
> private[sql] object CatalogManager {
> val SESSION_CATALOG_NAME: String = "spark_catalog"
> }{code}
>  
> The SESSION_CATALOG_NAME value cannot be modified。
> If multiple Hive Metastores exist, the platform manages multiple hms metadata 
> and classifies them by catalogName. A different catalog name is required
> [~fanjia] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37351) Supports write data flow control

2023-05-22 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724934#comment-17724934
 ] 

Jia Fan commented on SPARK-37351:
-

Do you want data flow control on micro-batch or batch mode?  This is a big 
feature, maybe should create SPIP and send to dev mail list. And I'm not sure 
the community will accept this change. cc [~cloud_fan]

> Supports write data flow control
> 
>
> Key: SPARK-37351
> URL: https://issues.apache.org/jira/browse/SPARK-37351
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: melin
>Priority: Major
>
> The hive table data is written to a relational database, generally an online 
> production database. If the writing speed has no traffic control, it can 
> easily affect the stability of the online system. It is recommended to add 
> traffic control parameters
> [~fanjia] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43338) Support modify the SESSION_CATALOG_NAME value

2023-05-22 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17724927#comment-17724927
 ] 

Jia Fan commented on SPARK-43338:
-

`If the same hive database has parquet and hudi table, does HiveTableCatalog 
support access to hudi table?`

Are you had try this? I'm not sure this change are necessary. `Session_Catalog` 
just one special catalog for spark.

`If multiple Hive Metastores exist, the platform manages multiple hms metadata 
and classifies them by catalogName.`

Seem like this requirement very match with Datasource V2, handle different 
catalog on spark. 

> Support  modify the SESSION_CATALOG_NAME value
> --
>
> Key: SPARK-43338
> URL: https://issues.apache.org/jira/browse/SPARK-43338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> {code:java}
> private[sql] object CatalogManager {
> val SESSION_CATALOG_NAME: String = "spark_catalog"
> }{code}
>  
> The SESSION_CATALOG_NAME value cannot be modified。
> If multiple Hive Metastores exist, the platform manages multiple hms metadata 
> and classifies them by catalogName. A different catalog name is required
> [~fanjia] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43521) Support CREATE TABLE LIKE FILE for PARQUET

2023-05-18 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723900#comment-17723900
 ] 

Jia Fan edited comment on SPARK-43521 at 5/19/23 1:31 AM:
--

I'm working on this!


was (Author: fanjia):
I'm working for this!

> Support CREATE TABLE LIKE FILE for PARQUET
> --
>
> Key: SPARK-43521
> URL: https://issues.apache.org/jira/browse/SPARK-43521
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> ref: https://issues.apache.org/jira/browse/HIVE-26395



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43521) Support CREATE TABLE LIKE FILE for PARQUET

2023-05-18 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723900#comment-17723900
 ] 

Jia Fan commented on SPARK-43521:
-

I'm working for this!

> Support CREATE TABLE LIKE FILE for PARQUET
> --
>
> Key: SPARK-43521
> URL: https://issues.apache.org/jira/browse/SPARK-43521
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: melin
>Priority: Major
>
> ref: https://issues.apache.org/jira/browse/HIVE-26395



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43488) bitmap function

2023-05-17 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723772#comment-17723772
 ] 

Jia Fan commented on SPARK-43488:
-

Hi, [~cloud_fan] If we want achieve this feature. Are we should implement new 
datatype like BitMap, then BitMap can use Roaringbitmap(Or just bigint) as data 
layer? Or we just use datatype bigint, then bitmapBuild(array[int]) will return 
bigint. The second way will be easiler. The fisrt way will be more flexible, we 
can implement different data layer for different array size, just like 
`Roaringbitmap`.

I want to achieve this feature, but I have some wrong about which plan should I 
choose.

> bitmap function
> ---
>
> Key: SPARK-43488
> URL: https://issues.apache.org/jira/browse/SPARK-43488
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: yiku123
>Priority: Major
>
> maybe spark need to have some bitmap functions？ example  like bitmapBuild 
> 、bitmapAnd、bitmapAndCardinality in clickhouse or other OLAP engine。
> This is often used in user profiling applications but i don't find in spark
>  
>  
> h2.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43522) Creating struct column occurs error 'org.apache.spark.sql.AnalysisException [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'

2023-05-16 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723158#comment-17723158
 ] 

Jia Fan commented on SPARK-43522:
-

https://github.com/apache/spark/pull/41187

> Creating struct column occurs  error 'org.apache.spark.sql.AnalysisException 
> [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING]'
> -
>
> Key: SPARK-43522
> URL: https://issues.apache.org/jira/browse/SPARK-43522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Heedo Lee
>Priority: Minor
>
> When creating a struct column in Dataframe, the code that ran without 
> problems in version 3.3.1 does not work in version 3.4.0.
>  
> Example
> {code:java}
> val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
> ",")).withColumn("map_entry", transform(col("key_value"), x => 
> struct(split(x, "=").getItem(0), split(x, "=").getItem(1) ) )){code}
>  
> In 3.3.1
>  
> {code:java}
>  
> testDF.show()
> +---+---++ 
> |      value|      key_value|           map_entry| 
> +---+---++ 
> |a=b,c=d,d=f|[a=b, c=d, d=f]|[{a, b}, {c, d}, ...| 
> +---+---++
>  
> testDF.printSchema()
> root
>  |-- value: string (nullable = true)
>  |-- key_value: array (nullable = true)
>  |    |-- element: string (containsNull = false)
>  |-- map_entry: array (nullable = true)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- col1: string (nullable = true)
>  |    |    |-- col2: string (nullable = true)
> {code}
>  
>  
> In 3.4.0
>  
> {code:java}
> org.apache.spark.sql.AnalysisException: 
> [DATATYPE_MISMATCH.CREATE_NAMED_STRUCT_WITHOUT_FOLDABLE_STRING] Cannot 
> resolve "struct(split(namedlambdavariable(), =, -1)[0], 
> split(namedlambdavariable(), =, -1)[1])" due to data type mismatch: Only 
> foldable `STRING` expressions are allowed to appear at odd position, but they 
> are ["0", "1"].;
> 'Project [value#41, key_value#45, transform(key_value#45, 
> lambdafunction(struct(0, split(lambda x_3#49, =, -1)[0], 1, split(lambda 
> x_3#49, =, -1)[1]), lambda x_3#49, false)) AS map_entry#48]
> +- Project [value#41, split(value#41, ,, -1) AS key_value#45]
>    +- LocalRelation [value#41]  at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:269)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:256)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
> 
>  
> {code}
>  
> However, if you do an alias to struct elements, you can get the same result 
> as the previous version.
>  
> {code:java}
> val testDF = Seq("a=b,c=d,d=f").toDF.withColumn("key_value", split('value, 
> ",")).withColumn("map_entry", transform(col("key_value"), x => 
> struct(split(x, "=").getItem(0).as("col1") , split(x, 
> "=").getItem(1).as("col2") ) )){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40189) Support json_array_get function

2023-05-13 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-40189:

Description: 
presto provides  json_array_get function，frequently used

[https://prestodb.io/docs/current/functions/json.html#json-functions]

  was:
presto provides these two functions，frequently used

https://prestodb.io/docs/current/functions/json.html#json-functions


> Support json_array_get function
> ---
>
> Key: SPARK-40189
> URL: https://issues.apache.org/jira/browse/SPARK-40189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: melin
>Priority: Major
>
> presto provides  json_array_get function，frequently used
> [https://prestodb.io/docs/current/functions/json.html#json-functions]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40189) Support json_array_get function

2023-05-13 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-40189:

Summary: Support json_array_get function  (was: Support 
json_array_get/json_array_length function)

> Support json_array_get function
> ---
>
> Key: SPARK-40189
> URL: https://issues.apache.org/jira/browse/SPARK-40189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: melin
>Priority: Major
>
> presto provides these two functions，frequently used
> https://prestodb.io/docs/current/functions/json.html#json-functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40189) Support json_array_get/json_array_length function

2023-05-13 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722443#comment-17722443
 ] 

Jia Fan commented on SPARK-40189:
-

Already have json_array_length function in SPARK-31008.

> Support json_array_get/json_array_length function
> -
>
> Key: SPARK-40189
> URL: https://issues.apache.org/jira/browse/SPARK-40189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: melin
>Priority: Major
>
> presto provides these two functions，frequently used
> https://prestodb.io/docs/current/functions/json.html#json-functions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-43492) Define the DATE_ADD and DATE_DIFF functions with 3-args

2023-05-13 Thread Jia Fan (Jira)



[ https://issues.apache.org/jira/browse/SPARK-43492 ]


Jia Fan deleted comment on SPARK-43492:
-

was (Author: fanjia):
I can fix this.

> Define the DATE_ADD and DATE_DIFF functions with 3-args
> ---
>
> Key: SPARK-43492
> URL: https://issues.apache.org/jira/browse/SPARK-43492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Spark supports the DATE_ADD and DATE_DIFF functions with 2 arguments but when 
> an user calls the same functions with 3 arguments, Spark SQL outputs the 
> confusing error:
> {code:sql}
> spark-sql (default)> select date_add(MONTH, 1, date'2023-05-13');
> [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column or function parameter with 
> name `MONTH` cannot be resolved. ; line 1 pos 16;
> 'Project [unresolvedalias('date_add('MONTH, 1, 2023-05-13), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43492) Define the DATE_ADD and DATE_DIFF functions with 3-args

2023-05-13 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722439#comment-17722439
 ] 

Jia Fan commented on SPARK-43492:
-

I can fix this.

> Define the DATE_ADD and DATE_DIFF functions with 3-args
> ---
>
> Key: SPARK-43492
> URL: https://issues.apache.org/jira/browse/SPARK-43492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Spark supports the DATE_ADD and DATE_DIFF functions with 2 arguments but when 
> an user calls the same functions with 3 arguments, Spark SQL outputs the 
> confusing error:
> {code:sql}
> spark-sql (default)> select date_add(MONTH, 1, date'2023-05-13');
> [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column or function parameter with 
> name `MONTH` cannot be resolved. ; line 1 pos 16;
> 'Project [unresolvedalias('date_add('MONTH, 1, 2023-05-13), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41401) spark3 stagedir can't be change

2023-05-13 Thread Jia Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Fan updated SPARK-41401:

Summary: spark3 stagedir can't be change   (was: spark2 stagedir can't be 
change )

> spark3 stagedir can't be change 
> 
>
> Key: SPARK-41401
> URL: https://issues.apache.org/jira/browse/SPARK-41401
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.2.3
>Reporter: sinlang
>Priority: Major
>
> i want't change different staging dir when write temporary data using , but 
> spark3 seen can only write in table path
> spark.yarn.stagingDir parameter only work when use spark2
>  
> in org.apache.spark.internal.io.FileCommitProtocol  file :  
>   def getStagingDir(path: String, jobId: String): Path = {
>     new Path(path, ".spark-staging-" + jobId)
>   }
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40129) Decimal multiply can produce the wrong answer because it rounds twice

2023-05-12 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1774#comment-1774
 ] 

Jia Fan commented on SPARK-40129:
-

https://github.com/apache/spark/pull/41156

> Decimal multiply can produce the wrong answer because it rounds twice
> -
>
> Key: SPARK-40129
> URL: https://issues.apache.org/jira/browse/SPARK-40129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> This looks like it has been around for a long time, but I have reproduced it 
> in 3.2.0+
> The example here is multiplying Decimal(38, 10) by another Decimal(38, 10), 
> but I think it can be reproduced with other number combinations, and possibly 
> with divide too.
> {code:java}
> Seq("9173594185998001607642838421.5479932913").toDF.selectExpr("CAST(value as 
> DECIMAL(38,10)) as a").selectExpr("a * CAST(-12 as 
> DECIMAL(38,10))").show(truncate=false)
> {code}
> This produces an answer in Spark of 
> {{-110083130231976019291714061058.575920}} But if I do the calculation in 
> regular java BigDecimal I get {{-110083130231976019291714061058.575919}}
> {code:java}
> BigDecimal l = new BigDecimal("9173594185998001607642838421.5479932913");
> BigDecimal r = new BigDecimal("-12.00");
> BigDecimal prod = l.multiply(r);
> BigDecimal rounded_prod = prod.setScale(6, RoundingMode.HALF_UP);
> {code}
> Spark does essentially all of the same operations, but it used Decimal to do 
> it instead of java's BigDecimal directly. Spark, by way of Decimal, will set 
> a MathContext for the multiply operation that has a max precision of 38 and 
> will do half up rounding. That means that the result of the multiply 
> operation in Spark is {{{}-110083130231976019291714061058.57591950{}}}, but 
> for the java BigDecimal code the result is 
> {{{}-110083130231976019291714061058.575919495600{}}}. Then in 
> CheckOverflow for 3.2.0 and 3.3.0 or in just the regular Multiply expression 
> in 3.4.0 the setScale is called (as a part of Decimal.setPrecision). At that 
> point the already rounded number is rounded yet again resulting in what is 
> arguably a wrong answer by Spark.
> I have not fully tested this, but it looks like we could just remove the 
> MathContext entirely in Decimal, or set it to UNLIMITED. All of the decimal 
> operations appear to have their own overflow and rounding anyways. If we want 
> to potentially reduce the total memory usage, we could also set the max 
> precision to 39 and truncate (round down) the result in the math context 
> instead.  That would then let us round the result correctly in setPrecision 
> afterwards.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43267) Support creating data frame from a Postgres table that contains user-defined array column

2023-05-12 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722029#comment-17722029
 ] 

Jia Fan commented on SPARK-43267:
-

https://github.com/apache/spark/pull/40953

> Support creating data frame from a Postgres table that contains user-defined 
> array column
> -
>
> Key: SPARK-43267
> URL: https://issues.apache.org/jira/browse/SPARK-43267
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0, 3.3.2
>Reporter: Sifan Huang
>Priority: Blocker
>
> Spark SQL now doesn’t support creating data frame from a Postgres table that 
> contains user-defined array column. However, it used to allow such type 
> before the Postgres JDBC commit 
> (https://github.com/pgjdbc/pgjdbc/commit/375cb3795c3330f9434cee9353f0791b86125914).
>  The previous behavior was to handle user-defined array column as String.
> Given:
>  * Postgres table with user-defined array column
>  * Function: DataFrameReader.jdbc - 
> https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/DataFrameReader.html#jdbc-java.lang.String-java.lang.String-java.util.Properties-
> Results:
>  * Exception “java.sql.SQLException: Unsupported type ARRAY” is thrown
> Expectation after the change:
>  * Function call succeeds
>  * User-defined array is converted as a string in Spark DataFrame
> Suggested fix:
>  * Update “getCatalystType” function in “PostgresDialect” as
>  ** 
> {code:java}
> val catalystType = toCatalystType(typeName.drop(1), size, 
> scale).map(ArrayType(_))
> if (catalystType.isEmpty) Some(StringType) else catalystType{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39420) Support ANALYZE TABLE on v2 tables

2023-05-12 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17722027#comment-17722027
 ] 

Jia Fan commented on SPARK-39420:
-

https://github.com/apache/spark/pull/4

> Support ANALYZE TABLE on v2 tables
> --
>
> Key: SPARK-39420
> URL: https://issues.apache.org/jira/browse/SPARK-39420
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Felipe
>Priority: Major
>
> According to https://github.com/delta-io/delta/pull/840 to implement ANALYZE 
> TABLE in Delta, we need to add the missing APIs in Spark to allow a data 
> source to report the file set to calculate the stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-43443) Add benchmark for Timestamp type inference when use invalid value

2023-05-10 Thread Jia Fan (Jira)

Jia Fan created SPARK-43443:
---

 Summary: Add benchmark for Timestamp type inference when use 
invalid value
 Key: SPARK-43443
 URL: https://issues.apache.org/jira/browse/SPARK-43443
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Jia Fan


We need a benchmark to measure whether our optimization of Timestamp type 
inference is useful, we have valid Timestamp value benchmark at now, but don't 
have invalid Timestamp value benchmark when use Timestamp type inference.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 147 matches

Mail list logo