[jira] [Created] (SPARK-46405) Issue with CSV schema inference and malformed records

2023-12-14 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-46405:
---

 Summary:  Issue with CSV schema inference and malformed records
 Key: SPARK-46405
 URL: https://issues.apache.org/jira/browse/SPARK-46405
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yaohua Zhao


There appears to be a discrepancy in the behavior of schema inference in the 
CSV reader compared to JSON. When processing CSV files without a predefined 
schema, the mechanism to handle malformed records seems to be inconsistent. 
Unlike the JSON format, where a `_corrupt_record` column is automatically added 
in the presence of malformed records, the CSV format does not exhibit this 
behavior. This inconsistency can lead to unexpected results and data loss 
during processing.

*Steps to Reproduce:*
 # Create a CSV file with malformed records without providing a schema.
 # Observe that the `_corrupt_record` column is not automatically added to the 
final dataframe.

*Expected Result:* The `_corrupt_record` column should be automatically added 
to the final dataframe when processing a CSV file with malformed records, 
similar to the behavior observed with JSON files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45815) Provide an interface for Streaming sources to add _metadata columns

2023-11-06 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-45815:

Component/s: Structured Streaming

> Provide an interface for Streaming sources to add _metadata columns
> ---
>
> Key: SPARK-45815
> URL: https://issues.apache.org/jira/browse/SPARK-45815
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Yaohua Zhao
>Priority: Major
>
> Currently, only the native V1 file-based streaming source can read the 
> `_metadata` column: 
> [https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63]
>  
> Our goal is to create an interface that allows other streaming sources to add 
> `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming 
> source, which you can find here: 
> [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49],
>  to extend this interface and provide the `{{{}_metadata`{}}} column for its 
> underlying storage format, such as Parquet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45815) Provide an interface for Streaming sources to add _metadata columns

2023-11-06 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-45815:

Description: 
Currently, only the native V1 file-based streaming source can read the 
`_metadata` column: 
[https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63]

 

Our goal is to create an interface that allows other streaming sources to add 
`{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming 
source, which you can find here: 
[https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49],
 to extend this interface and provide the `{{{}_metadata`{}}} column for its 
underlying storage format, such as Parquet.

  was:
Currently, only the native V1 file-based streaming source can read the 
`_metadata`{{{}{}}} column: 
https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63

 

Our goal is to create an interface that allows other streaming sources to add 
`{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming 
source, which you can find here: 
[https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49],
 to extend this interface and provide the `{{{}_metadata`{}}} column for its 
underlying storage format, such as Parquet.


> Provide an interface for Streaming sources to add _metadata columns
> ---
>
> Key: SPARK-45815
> URL: https://issues.apache.org/jira/browse/SPARK-45815
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Yaohua Zhao
>Priority: Major
>
> Currently, only the native V1 file-based streaming source can read the 
> `_metadata` column: 
> [https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63]
>  
> Our goal is to create an interface that allows other streaming sources to add 
> `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming 
> source, which you can find here: 
> [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49],
>  to extend this interface and provide the `{{{}_metadata`{}}} column for its 
> underlying storage format, such as Parquet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45815) Provide an interface for Streaming sources to add _metadata columns

2023-11-06 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-45815:
---

 Summary: Provide an interface for Streaming sources to add 
_metadata columns
 Key: SPARK-45815
 URL: https://issues.apache.org/jira/browse/SPARK-45815
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Yaohua Zhao


Currently, only the native V1 file-based streaming source can read the 
`_metadata`{{{}{}}} column: 
https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63

 

Our goal is to create an interface that allows other streaming sources to add 
`{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming 
source, which you can find here: 
[https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49],
 to extend this interface and provide the `{{{}_metadata`{}}} column for its 
underlying storage format, such as Parquet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45035) Support ignoreCorruptFiles for multiline CSV

2023-08-31 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-45035:
---

 Summary: Support ignoreCorruptFiles for multiline CSV
 Key: SPARK-45035
 URL: https://issues.apache.org/jira/browse/SPARK-45035
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yaohua Zhao


Today, `ignoreCorruptFiles` does not work well for multiline CSV mode.
{code:java}
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val testCorruptDF0 
= spark.read.option("ignoreCorruptFiles", "true").option("multiline", 
"true").csv("/tmp/sourcepath/").show() {code}
It throws an exception instead of ignoring silently:
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4940.0 
(TID 4031) (10.68.177.106 executor 0): 
com.univocity.parsers.common.TextParsingException: 
java.lang.IllegalStateException - Error reading from input
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Auto-closing enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Delimiters for detection=null
Empty value=
Escape unquoted values=false
Header extraction enabled=null
Headers=null
Ignore leading whitespaces=false
Ignore leading whitespaces in quotes=false
Ignore trailing whitespaces=false
Ignore trailing whitespaces in quotes=false
Input buffer size=1048576
Input reading on separate thread=false
Keep escape sequences=false
Keep quotes=false
Length of content displayed on error=1000
Line separator detection enabled=true
Maximum number of characters per column=-1
Maximum number of columns=20480
Normalize escaped line separators=true
Null value=
Number of records to read=all
Processor=none
Restricting data in exceptions=false
RowProcessor error handler=null
Selected fields=none
Skip bits as whitespace=true
Skip empty lines=true
Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
CsvFormat:
Comment character=#
Field delimiter=,
Line separator (normalized)=\n
Line separator sequence=\n
Quote character="
Quote escape character=\
Quote escape escape character=null
Internal state when error was thrown: line=0, column=0, record=0
at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
at 
com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
at 
com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.(UnivocityParser.scala:463)
at 
org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46...
 {code}
It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) 
which does not go through `FileScanRDD`. We could potentially add this support 
to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline parsing mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43177) Add deprecation warning for input_file_name()

2023-04-18 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-43177:
---

 Summary: Add deprecation warning for input_file_name()
 Key: SPARK-43177
 URL: https://issues.apache.org/jira/browse/SPARK-43177
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yaohua Zhao


With the new `_metadata` column, users shouldn’t need to use input_file_name() 
anymore. We should add a deprecation warning and update the docs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41151) Keep built-in file _metadata column nullable value consistent

2022-11-15 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-41151:
---

 Summary: Keep built-in file _metadata column nullable value 
consistent
 Key: SPARK-41151
 URL: https://issues.apache.org/jira/browse/SPARK-41151
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.1, 3.3.0, 3.3.2
Reporter: Yaohua Zhao


In FileSourceStrategy, we add an Alias node to wrap the file metadata fields 
(e.g. file_name, file_size) in a NamedStruct 
([here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L279]).
 But `CreateNamedStruct` has an override `nullable` value `false` 
([here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L443]),
 which is different from the `_metadata` struct `nullable` value `true` 
([here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L467]).
 

 

We should keep the nullable value the same, otherwise, the downstream 
optimization rules might use the nullability here and cause unexpected 
behaviors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41143) Add named arguments function syntax support and trait

2022-11-14 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-41143:

Description: Parser can parse: _FUNC{_}_{_} ( key0 => value0 )  (was: 
Parser can parse: _FUNC_ ( key0 => value0 ))

> Add named arguments function syntax support and trait
> -
>
> Key: SPARK-41143
> URL: https://issues.apache.org/jira/browse/SPARK-41143
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Yaohua Zhao
>Priority: Major
>
> Parser can parse: _FUNC{_}_{_} ( key0 => value0 )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41143) Add named arguments function syntax support and trait

2022-11-14 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-41143:

Description: 
The parser can parse:
{code:java}
_FUNC_ ( key0 => value0 ){code}

  was:Parser can parse: _{_}FUNC_{_} ( key0 => value0 )


> Add named arguments function syntax support and trait
> -
>
> Key: SPARK-41143
> URL: https://issues.apache.org/jira/browse/SPARK-41143
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Yaohua Zhao
>Priority: Major
>
> The parser can parse:
> {code:java}
> _FUNC_ ( key0 => value0 ){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41143) Add named arguments function syntax support and trait

2022-11-14 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-41143:

Description: Parser can parse: _{_}FUNC_{_} ( key0 => value0 )  (was: 
Parser can parse: _FUNC{_}_{_} ( key0 => value0 ))

> Add named arguments function syntax support and trait
> -
>
> Key: SPARK-41143
> URL: https://issues.apache.org/jira/browse/SPARK-41143
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: Yaohua Zhao
>Priority: Major
>
> Parser can parse: _{_}FUNC_{_} ( key0 => value0 )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41143) Add named arguments function syntax support and trait

2022-11-14 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-41143:
---

 Summary: Add named arguments function syntax support and trait
 Key: SPARK-41143
 URL: https://issues.apache.org/jira/browse/SPARK-41143
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.2
Reporter: Yaohua Zhao


Parser can parse: _FUNC_ ( key0 => value0 )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41142) Support named arguments functions

2022-11-14 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-41142:
---

 Summary: Support named arguments functions
 Key: SPARK-41142
 URL: https://issues.apache.org/jira/browse/SPARK-41142
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.2
Reporter: Yaohua Zhao


Support named arguments functions in Spark SQL.

General usage: _FUNC_(arg0, arg1, arg2, arg5 => value5, arg8 => value8)
 * Arguments can be passed positionally or by name
 * Positional arguments cannot come after a named argument.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40460) Streaming metrics is zero when select _metadata

2022-09-19 Thread Yaohua Zhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606678#comment-17606678
 ] 

Yaohua Zhao commented on SPARK-40460:
-

[~kabhwan] You are right! Updated

> Streaming metrics is zero when select _metadata
> ---
>
> Key: SPARK-40460
> URL: https://issues.apache.org/jira/browse/SPARK-40460
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
> Fix For: 3.4.0
>
>
> Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting 
> `_metadata` column. Because the logical plan from the batch and the actual 
> planned logical are mismatched: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata

2022-09-19 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-40460:

Affects Version/s: 3.4.0

> Streaming metrics is zero when select _metadata
> ---
>
> Key: SPARK-40460
> URL: https://issues.apache.org/jira/browse/SPARK-40460
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
> Fix For: 3.4.0
>
>
> Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting 
> `_metadata` column. Because the logical plan from the batch and the actual 
> planned logical are mismatched: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata

2022-09-19 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-40460:

Affects Version/s: 3.3.1
   3.3.2
   (was: 3.2.0)
   (was: 3.2.1)
   (was: 3.2.2)

> Streaming metrics is zero when select _metadata
> ---
>
> Key: SPARK-40460
> URL: https://issues.apache.org/jira/browse/SPARK-40460
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
> Fix For: 3.4.0
>
>
> Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting 
> `_metadata` column. Because the logical plan from the batch and the actual 
> planned logical are mismatched: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata

2022-09-15 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-40460:

Description: Streaming metrics report all 0 (`processedRowsPerSecond`, etc) 
when selecting `_metadata` column. Because the logical plan from the batch and 
the actual planned logical are mismatched: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]
  (was: Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when 
selecting `_metadata` column. Because the logical plan from the streaming 
relation and the actual executed logical plan are mismatched: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348])

> Streaming metrics is zero when select _metadata
> ---
>
> Key: SPARK-40460
> URL: https://issues.apache.org/jira/browse/SPARK-40460
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Yaohua Zhao
>Priority: Major
>
> Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting 
> `_metadata` column. Because the logical plan from the batch and the actual 
> planned logical are mismatched: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40460) Streaming metrics is zero when select _metadata

2022-09-15 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-40460:
---

 Summary: Streaming metrics is zero when select _metadata
 Key: SPARK-40460
 URL: https://issues.apache.org/jira/browse/SPARK-40460
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.2, 3.3.0, 3.2.1, 3.2.0
Reporter: Yaohua Zhao


Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting 
`_metadata` column. Because the logical plan from the streaming relation and 
the actual executed logical plan are mismatched: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39768) Strip any CRLF character if lineSep is not set in CSV data source

2022-07-13 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-39768:

Summary: Strip any CRLF character if lineSep is not set in CSV data source  
(was: Strip any CLRF character if lineSep is not set in CSV data source)

> Strip any CRLF character if lineSep is not set in CSV data source
> -
>
> Key: SPARK-39768
> URL: https://issues.apache.org/jira/browse/SPARK-39768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yaohua Zhao
>Priority: Minor
>
> If `lineSep` is not set, the line separator is automatically detected. To be 
> safe, we should strip any CLRF character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39768) Strip any CRLF character if lineSep is not set in CSV data source

2022-07-13 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-39768:

Description: If `lineSep` is not set, the line separator is automatically 
detected. To be safe, we should strip any _CRLF_ character at the suffix in the 
column names.  (was: If `lineSep` is not set, the line separator is 
automatically detected. To be safe, we should strip any CLRF character at the 
suffix in the column names.)

> Strip any CRLF character if lineSep is not set in CSV data source
> -
>
> Key: SPARK-39768
> URL: https://issues.apache.org/jira/browse/SPARK-39768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yaohua Zhao
>Priority: Minor
>
> If `lineSep` is not set, the line separator is automatically detected. To be 
> safe, we should strip any _CRLF_ character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39768) Strip any CLRF character if lineSep is not set in CSV data source

2022-07-13 Thread Yaohua Zhao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566417#comment-17566417
 ] 

Yaohua Zhao commented on SPARK-39768:
-

cc @[~hyukjin.kwon] 

> Strip any CLRF character if lineSep is not set in CSV data source
> -
>
> Key: SPARK-39768
> URL: https://issues.apache.org/jira/browse/SPARK-39768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yaohua Zhao
>Priority: Minor
>
> If `lineSep` is not set, the line separator is automatically detected. To be 
> safe, we should strip any CLRF character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39768) Strip any CLRF character if lineSep is not set in CSV data source

2022-07-13 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-39768:
---

 Summary: Strip any CLRF character if lineSep is not set in CSV 
data source
 Key: SPARK-39768
 URL: https://issues.apache.org/jira/browse/SPARK-39768
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yaohua Zhao


If `lineSep` is not set, the line separator is automatically detected. To be 
safe, we should strip any CLRF character at the suffix in the column names.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39689) Support 2-chars lineSep in CSV datasource

2022-07-05 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-39689:
---

 Summary: Support 2-chars lineSep in CSV datasource
 Key: SPARK-39689
 URL: https://issues.apache.org/jira/browse/SPARK-39689
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yaohua Zhao


Univocity parser allows to set line separator to 1 to 2 characters 
([code|https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/Format.java#L103]),
 CSV options should not block this usage 
([code|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala#L218]).

 

Due to the limitation around the `normalizedNewLine` 
(https://github.com/uniVocity/univocity-parsers/issues/170), setting 2 chars as 
a line separator could cause some weird/bad behaviors. Thus, we probably should 
leave this proposed fix as an undocumented feature and warn users to do this.

 

A more proper fix could be further investigated in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39404) Unable to query _metadata in streaming if getBatch returns multiple logical nodes in the DataFrame

2022-06-07 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-39404:
---

 Summary: Unable to query _metadata in streaming if getBatch 
returns multiple logical nodes in the DataFrame
 Key: SPARK-39404
 URL: https://issues.apache.org/jira/browse/SPARK-39404
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.1
Reporter: Yaohua Zhao


Here: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L585]

 

We should probably `transform` instead of `match`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex

2022-04-25 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-39014:
---

 Summary: Respect ignoreMissingFiles from Data Source options in 
InMemoryFileIndex
 Key: SPARK-39014
 URL: https://issues.apache.org/jira/browse/SPARK-39014
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38767) Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options

2022-04-01 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-38767:
---

 Summary: Support ignoreCorruptFiles and ignoreMissingFiles in Data 
Source options
 Key: SPARK-38767
 URL: https://issues.apache.org/jira/browse/SPARK-38767
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38323) Support the hidden file metadata in Streaming

2022-02-24 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-38323:

Description: 
Currently, querying the hidden file metadata struct `_metadata` will fail with 
`readStream`, `writeStream` APIs.
{code:java}
spark
  .readStream
  ...
  .select("_metadata")
  .writeStream
  ...
  .start(){code}
Need to expose the metadata output to `StreamingRelation` as well.

> Support the hidden file metadata in Streaming
> -
>
> Key: SPARK-38323
> URL: https://issues.apache.org/jira/browse/SPARK-38323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Major
>
> Currently, querying the hidden file metadata struct `_metadata` will fail 
> with `readStream`, `writeStream` APIs.
> {code:java}
> spark
>   .readStream
>   ...
>   .select("_metadata")
>   .writeStream
>   ...
>   .start(){code}
> Need to expose the metadata output to `StreamingRelation` as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38323) Support the hidden file metadata in Streaming

2022-02-24 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-38323:
---

 Summary: Support the hidden file metadata in Streaming
 Key: SPARK-38323
 URL: https://issues.apache.org/jira/browse/SPARK-38323
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Affects Versions: 3.2.1
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in

2022-02-24 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao updated SPARK-38314:

Description: 
Selecting and then writing df containing hidden file metadata column 
`_metadata` into a file format like `parquet`, `delta` will still keep the 
internal `Attribute` metadata information. Then when reading those `parquet`, 
`delta` files again, it will actually break the code, because it wrongly thinks 
user data schema named `_metadata` is a hidden file source metadata column.

 

Reproducible code:
{code:java}
// prepare a file source df
df.select("*", "_metadata")
  .write.format("parquet").save(path)
spark.read.format("parquet").load(path)
  .select("*").show(){code}

  was:
Selecting and then writing df containing hidden file metadata column 
`_metadata` into a file format like `parquet`, `delta` will still keep the 
internal `Attribute` metadata information. Then when reading those `parquet`, 
`delta` files again, it will actually break the code, because it wrongly thinks 
user data schema named `_metadata` is a hidden file source metadata column.

 

Reproducible code:

```

// prepare a file source df

df.select("*", "_metadata")
  .write.format("parquet").save(path)

spark.read.format("parquet").load(path)
  .select("*").show()

```


> Fail to read parquet files after writing the hidden file metadata in
> 
>
> Key: SPARK-38314
> URL: https://issues.apache.org/jira/browse/SPARK-38314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Major
>
> Selecting and then writing df containing hidden file metadata column 
> `_metadata` into a file format like `parquet`, `delta` will still keep the 
> internal `Attribute` metadata information. Then when reading those `parquet`, 
> `delta` files again, it will actually break the code, because it wrongly 
> thinks user data schema named `_metadata` is a hidden file source metadata 
> column.
>  
> Reproducible code:
> {code:java}
> // prepare a file source df
> df.select("*", "_metadata")
>   .write.format("parquet").save(path)
> spark.read.format("parquet").load(path)
>   .select("*").show(){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in

2022-02-24 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-38314:
---

 Summary: Fail to read parquet files after writing the hidden file 
metadata in
 Key: SPARK-38314
 URL: https://issues.apache.org/jira/browse/SPARK-38314
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Yaohua Zhao


Selecting and then writing df containing hidden file metadata column 
`_metadata` into a file format like `parquet`, `delta` will still keep the 
internal `Attribute` metadata information. Then when reading those `parquet`, 
`delta` files again, it will actually break the code, because it wrongly thinks 
user data schema named `_metadata` is a hidden file source metadata column.

 

Reproducible code:

```

// prepare a file source df

df.select("*", "_metadata")
  .write.format("parquet").save(path)

spark.read.format("parquet").load(path)
  .select("*").show()

```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37767) Follow-up Improvements of Hidden File Metadata Support for Spark SQL

2022-02-10 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao resolved SPARK-37767.
-
Resolution: Fixed

> Follow-up Improvements of Hidden File Metadata Support for Spark SQL
> 
>
> Key: SPARK-37767
> URL: https://issues.apache.org/jira/browse/SPARK-37767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yaohua Zhao
>Priority: Major
>
> Follow-up of https://issues.apache.org/jira/browse/SPARK-37273



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38159) Minor refactor of MetadataAttribute unapply method

2022-02-09 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-38159:
---

 Summary: Minor refactor of MetadataAttribute unapply method
 Key: SPARK-38159
 URL: https://issues.apache.org/jira/browse/SPARK-38159
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.1
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37770) Performance improvements for ColumnVector `putByteArray`

2022-02-09 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao resolved SPARK-37770.
-
Resolution: Fixed

> Performance improvements for ColumnVector `putByteArray`
> 
>
> Key: SPARK-37770
> URL: https://issues.apache.org/jira/browse/SPARK-37770
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yaohua Zhao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37896) ConstantColumnVector: a column vector with same values

2022-01-13 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37896:
---

 Summary: ConstantColumnVector: a column vector with same values
 Key: SPARK-37896
 URL: https://issues.apache.org/jira/browse/SPARK-37896
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao


Introduce a new column vector named `ConstantColumnVector`, it represents a 
column vector where every row has the same constant value.

It could help improve performance on hidden file metadata columnar file format, 
since metadata fields for every row in each file are exactly the same, we don't 
need to copy and keep multiple copies of data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37770) Performance improvements for ColumnVector `putByteArray`

2021-12-28 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37770:
---

 Summary: Performance improvements for ColumnVector `putByteArray`
 Key: SPARK-37770
 URL: https://issues.apache.org/jira/browse/SPARK-37770
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37769) Filter on the metadata struct

2021-12-28 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37769:
---

 Summary: Filter on the metadata struct
 Key: SPARK-37769
 URL: https://issues.apache.org/jira/browse/SPARK-37769
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao


Be able to skip reading some files based on filterings.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37768) Schema pruning for the metadata struct

2021-12-28 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37768:
---

 Summary: Schema pruning for the metadata struct
 Key: SPARK-37768
 URL: https://issues.apache.org/jira/browse/SPARK-37768
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37767) Follow-up Improvements of Hidden File Metadata Support for Spark SQL

2021-12-28 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37767:
---

 Summary: Follow-up Improvements of Hidden File Metadata Support 
for Spark SQL
 Key: SPARK-37767
 URL: https://issues.apache.org/jira/browse/SPARK-37767
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao


Follow-up of https://issues.apache.org/jira/browse/SPARK-37273



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37273) Hidden File Metadata Support for Spark SQL

2021-11-10 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-37273:
---

 Summary: Hidden File Metadata Support for Spark SQL
 Key: SPARK-37273
 URL: https://issues.apache.org/jira/browse/SPARK-37273
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yaohua Zhao


Provide a new interface in Spark SQL that allows users to query the metadata of 
the input files for all file formats, expose them as *built-in hidden columns* 
meaning *users can only see them when they explicitly reference them* (e.g. 
file path, file name)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org