[jira] [Created] (SPARK-46405) Issue with CSV schema inference and malformed records
Yaohua Zhao created SPARK-46405: --- Summary: Issue with CSV schema inference and malformed records Key: SPARK-46405 URL: https://issues.apache.org/jira/browse/SPARK-46405 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Yaohua Zhao There appears to be a discrepancy in the behavior of schema inference in the CSV reader compared to JSON. When processing CSV files without a predefined schema, the mechanism to handle malformed records seems to be inconsistent. Unlike the JSON format, where a `_corrupt_record` column is automatically added in the presence of malformed records, the CSV format does not exhibit this behavior. This inconsistency can lead to unexpected results and data loss during processing. *Steps to Reproduce:* # Create a CSV file with malformed records without providing a schema. # Observe that the `_corrupt_record` column is not automatically added to the final dataframe. *Expected Result:* The `_corrupt_record` column should be automatically added to the final dataframe when processing a CSV file with malformed records, similar to the behavior observed with JSON files. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45815) Provide an interface for Streaming sources to add _metadata columns
[ https://issues.apache.org/jira/browse/SPARK-45815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-45815: Component/s: Structured Streaming > Provide an interface for Streaming sources to add _metadata columns > --- > > Key: SPARK-45815 > URL: https://issues.apache.org/jira/browse/SPARK-45815 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.5.1 >Reporter: Yaohua Zhao >Priority: Major > > Currently, only the native V1 file-based streaming source can read the > `_metadata` column: > [https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63] > > Our goal is to create an interface that allows other streaming sources to add > `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming > source, which you can find here: > [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49], > to extend this interface and provide the `{{{}_metadata`{}}} column for its > underlying storage format, such as Parquet. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45815) Provide an interface for Streaming sources to add _metadata columns
[ https://issues.apache.org/jira/browse/SPARK-45815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-45815: Description: Currently, only the native V1 file-based streaming source can read the `_metadata` column: [https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63] Our goal is to create an interface that allows other streaming sources to add `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming source, which you can find here: [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49], to extend this interface and provide the `{{{}_metadata`{}}} column for its underlying storage format, such as Parquet. was: Currently, only the native V1 file-based streaming source can read the `_metadata`{{{}{}}} column: https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63 Our goal is to create an interface that allows other streaming sources to add `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming source, which you can find here: [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49], to extend this interface and provide the `{{{}_metadata`{}}} column for its underlying storage format, such as Parquet. > Provide an interface for Streaming sources to add _metadata columns > --- > > Key: SPARK-45815 > URL: https://issues.apache.org/jira/browse/SPARK-45815 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Yaohua Zhao >Priority: Major > > Currently, only the native V1 file-based streaming source can read the > `_metadata` column: > [https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63] > > Our goal is to create an interface that allows other streaming sources to add > `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming > source, which you can find here: > [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49], > to extend this interface and provide the `{{{}_metadata`{}}} column for its > underlying storage format, such as Parquet. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45815) Provide an interface for Streaming sources to add _metadata columns
Yaohua Zhao created SPARK-45815: --- Summary: Provide an interface for Streaming sources to add _metadata columns Key: SPARK-45815 URL: https://issues.apache.org/jira/browse/SPARK-45815 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Yaohua Zhao Currently, only the native V1 file-based streaming source can read the `_metadata`{{{}{}}} column: https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63 Our goal is to create an interface that allows other streaming sources to add `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming source, which you can find here: [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49], to extend this interface and provide the `{{{}_metadata`{}}} column for its underlying storage format, such as Parquet. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45035) Support ignoreCorruptFiles for multiline CSV
Yaohua Zhao created SPARK-45035: --- Summary: Support ignoreCorruptFiles for multiline CSV Key: SPARK-45035 URL: https://issues.apache.org/jira/browse/SPARK-45035 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Yaohua Zhao Today, `ignoreCorruptFiles` does not work well for multiline CSV mode. {code:java} spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val testCorruptDF0 = spark.read.option("ignoreCorruptFiles", "true").option("multiline", "true").csv("/tmp/sourcepath/").show() {code} It throws an exception instead of ignoring silently: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4940.0 (TID 4031) (10.68.177.106 executor 0): com.univocity.parsers.common.TextParsingException: java.lang.IllegalStateException - Error reading from input Parser Configuration: CsvParserSettings: Auto configuration enabled=true Auto-closing enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Delimiters for detection=null Empty value= Escape unquoted values=false Header extraction enabled=null Headers=null Ignore leading whitespaces=false Ignore leading whitespaces in quotes=false Ignore trailing whitespaces=false Ignore trailing whitespaces in quotes=false Input buffer size=1048576 Input reading on separate thread=false Keep escape sequences=false Keep quotes=false Length of content displayed on error=1000 Line separator detection enabled=true Maximum number of characters per column=-1 Maximum number of columns=20480 Normalize escaped line separators=true Null value= Number of records to read=all Processor=none Restricting data in exceptions=false RowProcessor error handler=null Selected fields=none Skip bits as whitespace=true Skip empty lines=true Unescaped quote handling=STOP_AT_DELIMITERFormat configuration: CsvFormat: Comment character=# Field delimiter=, Line separator (normalized)=\n Line separator sequence=\n Quote character=" Quote escape character=\ Quote escape escape character=null Internal state when error was thrown: line=0, column=0, record=0 at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402) at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277) at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843) at org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.(UnivocityParser.scala:463) at org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46... {code} It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) which does not go through `FileScanRDD`. We could potentially add this support to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline parsing mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43177) Add deprecation warning for input_file_name()
Yaohua Zhao created SPARK-43177: --- Summary: Add deprecation warning for input_file_name() Key: SPARK-43177 URL: https://issues.apache.org/jira/browse/SPARK-43177 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yaohua Zhao With the new `_metadata` column, users shouldn’t need to use input_file_name() anymore. We should add a deprecation warning and update the docs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41151) Keep built-in file _metadata column nullable value consistent
Yaohua Zhao created SPARK-41151: --- Summary: Keep built-in file _metadata column nullable value consistent Key: SPARK-41151 URL: https://issues.apache.org/jira/browse/SPARK-41151 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.1, 3.3.0, 3.3.2 Reporter: Yaohua Zhao In FileSourceStrategy, we add an Alias node to wrap the file metadata fields (e.g. file_name, file_size) in a NamedStruct ([here|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L279]). But `CreateNamedStruct` has an override `nullable` value `false` ([here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L443]), which is different from the `_metadata` struct `nullable` value `true` ([here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L467]). We should keep the nullable value the same, otherwise, the downstream optimization rules might use the nullability here and cause unexpected behaviors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41143) Add named arguments function syntax support and trait
[ https://issues.apache.org/jira/browse/SPARK-41143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-41143: Description: Parser can parse: _FUNC{_}_{_} ( key0 => value0 ) (was: Parser can parse: _FUNC_ ( key0 => value0 )) > Add named arguments function syntax support and trait > - > > Key: SPARK-41143 > URL: https://issues.apache.org/jira/browse/SPARK-41143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.2 >Reporter: Yaohua Zhao >Priority: Major > > Parser can parse: _FUNC{_}_{_} ( key0 => value0 ) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41143) Add named arguments function syntax support and trait
[ https://issues.apache.org/jira/browse/SPARK-41143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-41143: Description: The parser can parse: {code:java} _FUNC_ ( key0 => value0 ){code} was:Parser can parse: _{_}FUNC_{_} ( key0 => value0 ) > Add named arguments function syntax support and trait > - > > Key: SPARK-41143 > URL: https://issues.apache.org/jira/browse/SPARK-41143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.2 >Reporter: Yaohua Zhao >Priority: Major > > The parser can parse: > {code:java} > _FUNC_ ( key0 => value0 ){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41143) Add named arguments function syntax support and trait
[ https://issues.apache.org/jira/browse/SPARK-41143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-41143: Description: Parser can parse: _{_}FUNC_{_} ( key0 => value0 ) (was: Parser can parse: _FUNC{_}_{_} ( key0 => value0 )) > Add named arguments function syntax support and trait > - > > Key: SPARK-41143 > URL: https://issues.apache.org/jira/browse/SPARK-41143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.2 >Reporter: Yaohua Zhao >Priority: Major > > Parser can parse: _{_}FUNC_{_} ( key0 => value0 ) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41143) Add named arguments function syntax support and trait
Yaohua Zhao created SPARK-41143: --- Summary: Add named arguments function syntax support and trait Key: SPARK-41143 URL: https://issues.apache.org/jira/browse/SPARK-41143 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.2 Reporter: Yaohua Zhao Parser can parse: _FUNC_ ( key0 => value0 ) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41142) Support named arguments functions
Yaohua Zhao created SPARK-41142: --- Summary: Support named arguments functions Key: SPARK-41142 URL: https://issues.apache.org/jira/browse/SPARK-41142 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.2 Reporter: Yaohua Zhao Support named arguments functions in Spark SQL. General usage: _FUNC_(arg0, arg1, arg2, arg5 => value5, arg8 => value8) * Arguments can be passed positionally or by name * Positional arguments cannot come after a named argument. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606678#comment-17606678 ] Yaohua Zhao commented on SPARK-40460: - [~kabhwan] You are right! Updated > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.4.0 > > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-40460: Affects Version/s: 3.4.0 > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.4.0 > > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-40460: Affects Version/s: 3.3.1 3.3.2 (was: 3.2.0) (was: 3.2.1) (was: 3.2.2) > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.4.0 > > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-40460: Description: Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting `_metadata` column. Because the logical plan from the batch and the actual planned logical are mismatched: [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] (was: Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting `_metadata` column. Because the logical plan from the streaming relation and the actual executed logical plan are mismatched: [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348]) > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yaohua Zhao >Priority: Major > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40460) Streaming metrics is zero when select _metadata
Yaohua Zhao created SPARK-40460: --- Summary: Streaming metrics is zero when select _metadata Key: SPARK-40460 URL: https://issues.apache.org/jira/browse/SPARK-40460 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.2.2, 3.3.0, 3.2.1, 3.2.0 Reporter: Yaohua Zhao Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting `_metadata` column. Because the logical plan from the streaming relation and the actual executed logical plan are mismatched: [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39768) Strip any CRLF character if lineSep is not set in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-39768: Summary: Strip any CRLF character if lineSep is not set in CSV data source (was: Strip any CLRF character if lineSep is not set in CSV data source) > Strip any CRLF character if lineSep is not set in CSV data source > - > > Key: SPARK-39768 > URL: https://issues.apache.org/jira/browse/SPARK-39768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yaohua Zhao >Priority: Minor > > If `lineSep` is not set, the line separator is automatically detected. To be > safe, we should strip any CLRF character at the suffix in the column names. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39768) Strip any CRLF character if lineSep is not set in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-39768: Description: If `lineSep` is not set, the line separator is automatically detected. To be safe, we should strip any _CRLF_ character at the suffix in the column names. (was: If `lineSep` is not set, the line separator is automatically detected. To be safe, we should strip any CLRF character at the suffix in the column names.) > Strip any CRLF character if lineSep is not set in CSV data source > - > > Key: SPARK-39768 > URL: https://issues.apache.org/jira/browse/SPARK-39768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yaohua Zhao >Priority: Minor > > If `lineSep` is not set, the line separator is automatically detected. To be > safe, we should strip any _CRLF_ character at the suffix in the column names. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39768) Strip any CLRF character if lineSep is not set in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-39768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566417#comment-17566417 ] Yaohua Zhao commented on SPARK-39768: - cc @[~hyukjin.kwon] > Strip any CLRF character if lineSep is not set in CSV data source > - > > Key: SPARK-39768 > URL: https://issues.apache.org/jira/browse/SPARK-39768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yaohua Zhao >Priority: Minor > > If `lineSep` is not set, the line separator is automatically detected. To be > safe, we should strip any CLRF character at the suffix in the column names. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39768) Strip any CLRF character if lineSep is not set in CSV data source
Yaohua Zhao created SPARK-39768: --- Summary: Strip any CLRF character if lineSep is not set in CSV data source Key: SPARK-39768 URL: https://issues.apache.org/jira/browse/SPARK-39768 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Yaohua Zhao If `lineSep` is not set, the line separator is automatically detected. To be safe, we should strip any CLRF character at the suffix in the column names. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39689) Support 2-chars lineSep in CSV datasource
Yaohua Zhao created SPARK-39689: --- Summary: Support 2-chars lineSep in CSV datasource Key: SPARK-39689 URL: https://issues.apache.org/jira/browse/SPARK-39689 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Yaohua Zhao Univocity parser allows to set line separator to 1 to 2 characters ([code|https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/Format.java#L103]), CSV options should not block this usage ([code|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala#L218]). Due to the limitation around the `normalizedNewLine` (https://github.com/uniVocity/univocity-parsers/issues/170), setting 2 chars as a line separator could cause some weird/bad behaviors. Thus, we probably should leave this proposed fix as an undocumented feature and warn users to do this. A more proper fix could be further investigated in the future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39404) Unable to query _metadata in streaming if getBatch returns multiple logical nodes in the DataFrame
Yaohua Zhao created SPARK-39404: --- Summary: Unable to query _metadata in streaming if getBatch returns multiple logical nodes in the DataFrame Key: SPARK-39404 URL: https://issues.apache.org/jira/browse/SPARK-39404 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.2.1 Reporter: Yaohua Zhao Here: [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L585] We should probably `transform` instead of `match` -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39014) Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex
Yaohua Zhao created SPARK-39014: --- Summary: Respect ignoreMissingFiles from Data Source options in InMemoryFileIndex Key: SPARK-39014 URL: https://issues.apache.org/jira/browse/SPARK-39014 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: Yaohua Zhao -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38767) Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options
Yaohua Zhao created SPARK-38767: --- Summary: Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options Key: SPARK-38767 URL: https://issues.apache.org/jira/browse/SPARK-38767 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: Yaohua Zhao -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38323) Support the hidden file metadata in Streaming
[ https://issues.apache.org/jira/browse/SPARK-38323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-38323: Description: Currently, querying the hidden file metadata struct `_metadata` will fail with `readStream`, `writeStream` APIs. {code:java} spark .readStream ... .select("_metadata") .writeStream ... .start(){code} Need to expose the metadata output to `StreamingRelation` as well. > Support the hidden file metadata in Streaming > - > > Key: SPARK-38323 > URL: https://issues.apache.org/jira/browse/SPARK-38323 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Major > > Currently, querying the hidden file metadata struct `_metadata` will fail > with `readStream`, `writeStream` APIs. > {code:java} > spark > .readStream > ... > .select("_metadata") > .writeStream > ... > .start(){code} > Need to expose the metadata output to `StreamingRelation` as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38323) Support the hidden file metadata in Streaming
Yaohua Zhao created SPARK-38323: --- Summary: Support the hidden file metadata in Streaming Key: SPARK-38323 URL: https://issues.apache.org/jira/browse/SPARK-38323 Project: Spark Issue Type: Improvement Components: SQL, Structured Streaming Affects Versions: 3.2.1 Reporter: Yaohua Zhao -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in
[ https://issues.apache.org/jira/browse/SPARK-38314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-38314: Description: Selecting and then writing df containing hidden file metadata column `_metadata` into a file format like `parquet`, `delta` will still keep the internal `Attribute` metadata information. Then when reading those `parquet`, `delta` files again, it will actually break the code, because it wrongly thinks user data schema named `_metadata` is a hidden file source metadata column. Reproducible code: {code:java} // prepare a file source df df.select("*", "_metadata") .write.format("parquet").save(path) spark.read.format("parquet").load(path) .select("*").show(){code} was: Selecting and then writing df containing hidden file metadata column `_metadata` into a file format like `parquet`, `delta` will still keep the internal `Attribute` metadata information. Then when reading those `parquet`, `delta` files again, it will actually break the code, because it wrongly thinks user data schema named `_metadata` is a hidden file source metadata column. Reproducible code: ``` // prepare a file source df df.select("*", "_metadata") .write.format("parquet").save(path) spark.read.format("parquet").load(path) .select("*").show() ``` > Fail to read parquet files after writing the hidden file metadata in > > > Key: SPARK-38314 > URL: https://issues.apache.org/jira/browse/SPARK-38314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Major > > Selecting and then writing df containing hidden file metadata column > `_metadata` into a file format like `parquet`, `delta` will still keep the > internal `Attribute` metadata information. Then when reading those `parquet`, > `delta` files again, it will actually break the code, because it wrongly > thinks user data schema named `_metadata` is a hidden file source metadata > column. > > Reproducible code: > {code:java} > // prepare a file source df > df.select("*", "_metadata") > .write.format("parquet").save(path) > spark.read.format("parquet").load(path) > .select("*").show(){code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38314) Fail to read parquet files after writing the hidden file metadata in
Yaohua Zhao created SPARK-38314: --- Summary: Fail to read parquet files after writing the hidden file metadata in Key: SPARK-38314 URL: https://issues.apache.org/jira/browse/SPARK-38314 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: Yaohua Zhao Selecting and then writing df containing hidden file metadata column `_metadata` into a file format like `parquet`, `delta` will still keep the internal `Attribute` metadata information. Then when reading those `parquet`, `delta` files again, it will actually break the code, because it wrongly thinks user data schema named `_metadata` is a hidden file source metadata column. Reproducible code: ``` // prepare a file source df df.select("*", "_metadata") .write.format("parquet").save(path) spark.read.format("parquet").load(path) .select("*").show() ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37767) Follow-up Improvements of Hidden File Metadata Support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-37767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao resolved SPARK-37767. - Resolution: Fixed > Follow-up Improvements of Hidden File Metadata Support for Spark SQL > > > Key: SPARK-37767 > URL: https://issues.apache.org/jira/browse/SPARK-37767 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yaohua Zhao >Priority: Major > > Follow-up of https://issues.apache.org/jira/browse/SPARK-37273 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38159) Minor refactor of MetadataAttribute unapply method
Yaohua Zhao created SPARK-38159: --- Summary: Minor refactor of MetadataAttribute unapply method Key: SPARK-38159 URL: https://issues.apache.org/jira/browse/SPARK-38159 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.1 Reporter: Yaohua Zhao -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37770) Performance improvements for ColumnVector `putByteArray`
[ https://issues.apache.org/jira/browse/SPARK-37770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao resolved SPARK-37770. - Resolution: Fixed > Performance improvements for ColumnVector `putByteArray` > > > Key: SPARK-37770 > URL: https://issues.apache.org/jira/browse/SPARK-37770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yaohua Zhao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37896) ConstantColumnVector: a column vector with same values
Yaohua Zhao created SPARK-37896: --- Summary: ConstantColumnVector: a column vector with same values Key: SPARK-37896 URL: https://issues.apache.org/jira/browse/SPARK-37896 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yaohua Zhao Introduce a new column vector named `ConstantColumnVector`, it represents a column vector where every row has the same constant value. It could help improve performance on hidden file metadata columnar file format, since metadata fields for every row in each file are exactly the same, we don't need to copy and keep multiple copies of data. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37770) Performance improvements for ColumnVector `putByteArray`
Yaohua Zhao created SPARK-37770: --- Summary: Performance improvements for ColumnVector `putByteArray` Key: SPARK-37770 URL: https://issues.apache.org/jira/browse/SPARK-37770 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Yaohua Zhao -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37769) Filter on the metadata struct
Yaohua Zhao created SPARK-37769: --- Summary: Filter on the metadata struct Key: SPARK-37769 URL: https://issues.apache.org/jira/browse/SPARK-37769 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Yaohua Zhao Be able to skip reading some files based on filterings. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37768) Schema pruning for the metadata struct
Yaohua Zhao created SPARK-37768: --- Summary: Schema pruning for the metadata struct Key: SPARK-37768 URL: https://issues.apache.org/jira/browse/SPARK-37768 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Yaohua Zhao -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37767) Follow-up Improvements of Hidden File Metadata Support for Spark SQL
Yaohua Zhao created SPARK-37767: --- Summary: Follow-up Improvements of Hidden File Metadata Support for Spark SQL Key: SPARK-37767 URL: https://issues.apache.org/jira/browse/SPARK-37767 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yaohua Zhao Follow-up of https://issues.apache.org/jira/browse/SPARK-37273 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37273) Hidden File Metadata Support for Spark SQL
Yaohua Zhao created SPARK-37273: --- Summary: Hidden File Metadata Support for Spark SQL Key: SPARK-37273 URL: https://issues.apache.org/jira/browse/SPARK-37273 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yaohua Zhao Provide a new interface in Spark SQL that allows users to query the metadata of the input files for all file formats, expose them as *built-in hidden columns* meaning *users can only see them when they explicitly reference them* (e.g. file path, file name) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org