[jira] [Updated] (SPARK-48691) Upgrade `scalatest` related dependencies to the 3.2.18 series
[ https://issues.apache.org/jira/browse/SPARK-48691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48691: Summary: Upgrade `scalatest` related dependencies to the 3.2.18 series (was: Upgrade `mockito` to 5.12.0) > Upgrade `scalatest` related dependencies to the 3.2.18 series > - > > Key: SPARK-48691 > URL: https://issues.apache.org/jira/browse/SPARK-48691 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48689) Reading lengthy JSON results in a corrupted record.
[ https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856910#comment-17856910 ] Wei Guo edited comment on SPARK-48689 at 6/22/24 2:33 PM: -- This is controlled by option `maxStringLen`, the default value of it is 20,000,000. If you set the option big enough when reading, you can get the right result. {code:java} spark.read.option("maxStringLen", 1).json(path){code} I did a test with a long string of length 20,000,010 and proved that: !image-2024-06-22-15-33-38-833.png! was (Author: wayne guo): This is controlled by option `maxStringLen`, the default value of it is 20,000,000. If you set the option when reading, you can get the right result. {code:java} spark.read.option("maxStringLen", 1).json(path){code} I did a test with a long string of length 20,000,010 and proved that: !image-2024-06-22-15-33-38-833.png! > Reading lengthy JSON results in a corrupted record. > --- > > Key: SPARK-48689 > URL: https://issues.apache.org/jira/browse/SPARK-48689 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22 >Reporter: Yuxiang Wei >Priority: Major > Labels: Reader > Attachments: image-2024-06-22-15-33-38-833.png > > > When reading a data frame from a JSON file including a very long string, > spark will incorrectly make it a corrupted record even though the format is > correct. Here is a minimal example with PySpark: > {{import json}} > {{import tempfile}} > {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} > {{spark = (SparkSession.builde}} > {{ .appName("PySpark JSON Example")}} > {{ .getOrCreate()}} > {{{}){}}}{{{}# Define the JSON content{}}} > {{data = {}} > {{ "text": "a" * 1}} > {{{}}{}}}{{{}# Create a temporary file{}}} > {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as > tmp_file:}} > {{ # Write the JSON content to the temporary file}} > {{ tmp_file.write(json.dumps(data) + "\n")}} > {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a > PySpark DataFrame}} > {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} > {{ print(df)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48689) Reading lengthy JSON results in a corrupted record.
[ https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856910#comment-17856910 ] Wei Guo edited comment on SPARK-48689 at 6/22/24 7:36 AM: -- This is controlled by option `maxStringLen`, the default value of it is 20,000,000. If you set the option when reading, you can get the right result. {code:java} spark.read.option("maxStringLen", 1).json(path){code} I did a test with a long string of length 20,000,010 and proved that: !image-2024-06-22-15-33-38-833.png! was (Author: wayne guo): This is controlled by option `maxStringLen`, the default value of it is 20,000,000. If you set the option when reading, you can get the right result. {code:java} spark.read.option("maxStringLen", 1).json(path){code} I made a test with a 20,000,010 length string: !image-2024-06-22-15-33-38-833.png! > Reading lengthy JSON results in a corrupted record. > --- > > Key: SPARK-48689 > URL: https://issues.apache.org/jira/browse/SPARK-48689 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22 >Reporter: Yuxiang Wei >Priority: Major > Labels: Reader > Attachments: image-2024-06-22-15-33-38-833.png > > > When reading a data frame from a JSON file including a very long string, > spark will incorrectly make it a corrupted record even though the format is > correct. Here is a minimal example with PySpark: > {{import json}} > {{import tempfile}} > {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} > {{spark = (SparkSession.builde}} > {{ .appName("PySpark JSON Example")}} > {{ .getOrCreate()}} > {{{}){}}}{{{}# Define the JSON content{}}} > {{data = {}} > {{ "text": "a" * 1}} > {{{}}{}}}{{{}# Create a temporary file{}}} > {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as > tmp_file:}} > {{ # Write the JSON content to the temporary file}} > {{ tmp_file.write(json.dumps(data) + "\n")}} > {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a > PySpark DataFrame}} > {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} > {{ print(df)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48689) Reading lengthy JSON results in a corrupted record.
[ https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48689: Attachment: image-2024-06-22-15-33-38-833.png > Reading lengthy JSON results in a corrupted record. > --- > > Key: SPARK-48689 > URL: https://issues.apache.org/jira/browse/SPARK-48689 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22 >Reporter: Yuxiang Wei >Priority: Major > Labels: Reader > Attachments: image-2024-06-22-15-33-38-833.png > > > When reading a data frame from a JSON file including a very long string, > spark will incorrectly make it a corrupted record even though the format is > correct. Here is a minimal example with PySpark: > {{import json}} > {{import tempfile}} > {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} > {{spark = (SparkSession.builde}} > {{ .appName("PySpark JSON Example")}} > {{ .getOrCreate()}} > {{{}){}}}{{{}# Define the JSON content{}}} > {{data = {}} > {{ "text": "a" * 1}} > {{{}}{}}}{{{}# Create a temporary file{}}} > {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as > tmp_file:}} > {{ # Write the JSON content to the temporary file}} > {{ tmp_file.write(json.dumps(data) + "\n")}} > {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a > PySpark DataFrame}} > {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} > {{ print(df)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48689) Reading lengthy JSON results in a corrupted record.
[ https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856910#comment-17856910 ] Wei Guo commented on SPARK-48689: - This is controlled by option `maxStringLen`, the default value of it is 20,000,000. If you set the option when reading, you can get the right result. {code:java} spark.read.option("maxStringLen", 1).json(path){code} I made a test with a 20,000,010 length string: !image-2024-06-22-15-33-38-833.png! > Reading lengthy JSON results in a corrupted record. > --- > > Key: SPARK-48689 > URL: https://issues.apache.org/jira/browse/SPARK-48689 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.1 > Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22 >Reporter: Yuxiang Wei >Priority: Major > Labels: Reader > Attachments: image-2024-06-22-15-33-38-833.png > > > When reading a data frame from a JSON file including a very long string, > spark will incorrectly make it a corrupted record even though the format is > correct. Here is a minimal example with PySpark: > {{import json}} > {{import tempfile}} > {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}} > {{spark = (SparkSession.builde}} > {{ .appName("PySpark JSON Example")}} > {{ .getOrCreate()}} > {{{}){}}}{{{}# Define the JSON content{}}} > {{data = {}} > {{ "text": "a" * 1}} > {{{}}{}}}{{{}# Create a temporary file{}}} > {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as > tmp_file:}} > {{ # Write the JSON content to the temporary file}} > {{ tmp_file.write(json.dumps(data) + "\n")}} > {{ tmp_file_path = tmp_file.name}}{{ # Load the JSON file into a > PySpark DataFrame}} > {{ df = spark.read.json(tmp_file_path)}}{{ # Print the schema}} > {{ print(df)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48671) Add test cases for Hex.hex
Wei Guo created SPARK-48671: --- Summary: Add test cases for Hex.hex Key: SPARK-48671 URL: https://issues.apache.org/jira/browse/SPARK-48671 Project: Spark Issue Type: Test Components: SQL Affects Versions: 4.0.0 Reporter: Wei Guo -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect
[ https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856122#comment-17856122 ] Wei Guo commented on SPARK-48660: - I am working on this and thank your for recommendation [~yangjie01] . > The result of explain is incorrect for CreateTableAsSelect > -- > > Key: SPARK-48660 > URL: https://issues.apache.org/jira/browse/SPARK-48660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > CREATE TABLE order_history_version_audit_rno ( > eventid STRING, > id STRING, > referenceid STRING, > type STRING, > referencetype STRING, > sellerid BIGINT, > buyerid BIGINT, > producerid STRING, > versionid INT, > changedocuments ARRAY BIGINT, changeDetails: STRING>>, > dt STRING, > hr STRING) > USING parquet > PARTITIONED BY (dt, hr); > explain cost > CREATE TABLE order_history_version_audit_rno > USING parquet > PARTITIONED BY (dt) > CLUSTERED BY (id) INTO 1000 buckets > AS SELECT * FROM order_history_version_audit_rno > WHERE dt >= '2023-11-29'; > {code} > {noformat} > spark-sql (default)> >> explain cost >> CREATE TABLE order_history_version_audit_rno >> USING parquet >> PARTITIONED BY (dt) >> CLUSTERED BY (id) INTO 1000 buckets >> AS SELECT * FROM order_history_version_audit_rno >> WHERE dt >= '2023-11-29'; > == Optimized Logical Plan == > CreateDataSourceTableAsSelectCommand > `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, > [eventid, id, referenceid, type, referencetype, sellerid, buyerid, > producerid, versionid, changedocuments, hr, dt] >+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, > sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, > hr#16, dt#15] > +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, > sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, > dt#15, hr#16] > +- Filter (dt#15 >= 2023-11-29) > +- SubqueryAlias > spark_catalog.default.order_history_version_audit_rno >+- Relation > spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] > parquet > == Physical Plan == > Execute CreateDataSourceTableAsSelectCommand >+- CreateDataSourceTableAsSelectCommand > `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, > [eventid, id, referenceid, type, referencetype, sellerid, buyerid, > producerid, versionid, changedocuments, hr, dt] > +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, > sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, > hr#16, dt#15] > +- Project [eventid#5, id#6, referenceid#7, type#8, > referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, > changedocuments#14, dt#15, hr#16] >+- Filter (dt#15 >= 2023-11-29) > +- SubqueryAlias > spark_catalog.default.order_history_version_audit_rno > +- Relation > spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] > parquet > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect
[ https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856122#comment-17856122 ] Wei Guo edited comment on SPARK-48660 at 6/19/24 4:18 AM: -- I am working on this and thank your for recommendation [~LuciferYang] was (Author: wayne guo): I am working on this and thank your for recommendation [~yangjie01] . > The result of explain is incorrect for CreateTableAsSelect > -- > > Key: SPARK-48660 > URL: https://issues.apache.org/jira/browse/SPARK-48660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > CREATE TABLE order_history_version_audit_rno ( > eventid STRING, > id STRING, > referenceid STRING, > type STRING, > referencetype STRING, > sellerid BIGINT, > buyerid BIGINT, > producerid STRING, > versionid INT, > changedocuments ARRAY BIGINT, changeDetails: STRING>>, > dt STRING, > hr STRING) > USING parquet > PARTITIONED BY (dt, hr); > explain cost > CREATE TABLE order_history_version_audit_rno > USING parquet > PARTITIONED BY (dt) > CLUSTERED BY (id) INTO 1000 buckets > AS SELECT * FROM order_history_version_audit_rno > WHERE dt >= '2023-11-29'; > {code} > {noformat} > spark-sql (default)> >> explain cost >> CREATE TABLE order_history_version_audit_rno >> USING parquet >> PARTITIONED BY (dt) >> CLUSTERED BY (id) INTO 1000 buckets >> AS SELECT * FROM order_history_version_audit_rno >> WHERE dt >= '2023-11-29'; > == Optimized Logical Plan == > CreateDataSourceTableAsSelectCommand > `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, > [eventid, id, referenceid, type, referencetype, sellerid, buyerid, > producerid, versionid, changedocuments, hr, dt] >+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, > sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, > hr#16, dt#15] > +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, > sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, > dt#15, hr#16] > +- Filter (dt#15 >= 2023-11-29) > +- SubqueryAlias > spark_catalog.default.order_history_version_audit_rno >+- Relation > spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] > parquet > == Physical Plan == > Execute CreateDataSourceTableAsSelectCommand >+- CreateDataSourceTableAsSelectCommand > `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, > [eventid, id, referenceid, type, referencetype, sellerid, buyerid, > producerid, versionid, changedocuments, hr, dt] > +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, > sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, > hr#16, dt#15] > +- Project [eventid#5, id#6, referenceid#7, type#8, > referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, > changedocuments#14, dt#15, hr#16] >+- Filter (dt#15 >= 2023-11-29) > +- SubqueryAlias > spark_catalog.default.order_history_version_audit_rno > +- Relation > spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16] > parquet > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48661) Upgrade RoaringBitmap to 1.1.0
Wei Guo created SPARK-48661: --- Summary: Upgrade RoaringBitmap to 1.1.0 Key: SPARK-48661 URL: https://issues.apache.org/jira/browse/SPARK-48661 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 4.0.0 Reporter: Wei Guo -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error
[ https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48635: Summary: Assign classes to join type errors and as-of join error (was: Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217 ) > Assign classes to join type errors and as-of join error > - > > Key: SPARK-48635 > URL: https://issues.apache.org/jira/browse/SPARK-48635 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Minor > > job type errors: > LEGACY_ERROR_TEMP[1319, 3216] > as-of join error: > _LEGACY_ERROR_TEMP_3217 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217
[ https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48635: Summary: Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217 (was: Assign classes to join type errors _LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217 ) > Assign classes to join type errors and as-of join error > _LEGACY_ERROR_TEMP_3217 > -- > > Key: SPARK-48635 > URL: https://issues.apache.org/jira/browse/SPARK-48635 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217
[ https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48635: Description: job type errors: LEGACY_ERROR_TEMP[1319, 3216] as-of join error: _LEGACY_ERROR_TEMP_3217 was: job type errors: _LEGACY_ERROR_TEMP_[1319, 3216] as-of join error: _LEGACY_ERROR_TEMP_3217 > Assign classes to join type errors and as-of join error > _LEGACY_ERROR_TEMP_3217 > -- > > Key: SPARK-48635 > URL: https://issues.apache.org/jira/browse/SPARK-48635 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Minor > > job type errors: > LEGACY_ERROR_TEMP[1319, 3216] > as-of join error: > _LEGACY_ERROR_TEMP_3217 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217
[ https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48635: Description: job type errors: _LEGACY_ERROR_TEMP_[1319, 3216] as-of join error: > Assign classes to join type errors and as-of join error > _LEGACY_ERROR_TEMP_3217 > -- > > Key: SPARK-48635 > URL: https://issues.apache.org/jira/browse/SPARK-48635 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Minor > > job type errors: > _LEGACY_ERROR_TEMP_[1319, 3216] > as-of join error: -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217
[ https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48635: Description: job type errors: _LEGACY_ERROR_TEMP_[1319, 3216] as-of join error: _LEGACY_ERROR_TEMP_3217 was: job type errors: _LEGACY_ERROR_TEMP_[1319, 3216] as-of join error: > Assign classes to join type errors and as-of join error > _LEGACY_ERROR_TEMP_3217 > -- > > Key: SPARK-48635 > URL: https://issues.apache.org/jira/browse/SPARK-48635 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Minor > > job type errors: > _LEGACY_ERROR_TEMP_[1319, 3216] > as-of join error: > _LEGACY_ERROR_TEMP_3217 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48635) Assign classes to join type errors _LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217
Wei Guo created SPARK-48635: --- Summary: Assign classes to join type errors _LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217 Key: SPARK-48635 URL: https://issues.apache.org/jira/browse/SPARK-48635 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Wei Guo -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48614) Cleanup deprecated api usage related to kafka-clients
[ https://issues.apache.org/jira/browse/SPARK-48614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48614: Description: (was: There are some deprecated classes and methods in commons-io called in Spark, we need to replace them: * writeStringToFile(final File file, final String data) * CountingInputStream) > Cleanup deprecated api usage related to kafka-clients > - > > Key: SPARK-48614 > URL: https://issues.apache.org/jira/browse/SPARK-48614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Assignee: Wei Guo >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48614) Cleanup deprecated api usage related to kafka-clients
Wei Guo created SPARK-48614: --- Summary: Cleanup deprecated api usage related to kafka-clients Key: SPARK-48614 URL: https://issues.apache.org/jira/browse/SPARK-48614 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wei Guo Assignee: Wei Guo Fix For: 4.0.0 There are some deprecated classes and methods in commons-io called in Spark, we need to replace them: * writeStringToFile(final File file, final String data) * CountingInputStream -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48604) Replace deprecated classes and methods of arrow-vector called in Spark
Wei Guo created SPARK-48604: --- Summary: Replace deprecated classes and methods of arrow-vector called in Spark Key: SPARK-48604 URL: https://issues.apache.org/jira/browse/SPARK-48604 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wei Guo There are some deprecated classes and methods in commons-io called in Spark, we need to replace them: * writeStringToFile(final File file, final String data) * CountingInputStream -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48604) Replace deprecated classes and methods of arrow-vector called in Spark
[ https://issues.apache.org/jira/browse/SPARK-48604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48604: Description: There are some deprecated classes and methods in arrow-vector called in Spark, we need to replace them: * ArrowType.Decimal(precision, scale) was: There are some deprecated classes and methods in commons-io called in Spark, we need to replace them: * writeStringToFile(final File file, final String data) * CountingInputStream > Replace deprecated classes and methods of arrow-vector called in Spark > -- > > Key: SPARK-48604 > URL: https://issues.apache.org/jira/browse/SPARK-48604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Major > Labels: pull-request-available > > There are some deprecated classes and methods in arrow-vector called in > Spark, we need to replace them: > * ArrowType.Decimal(precision, scale) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of commons-io called in Spark
[ https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48583: Summary: Replace deprecated classes and methods of commons-io called in Spark (was: Replace deprecated classes and methods of `commons-io` called in Spark) > Replace deprecated classes and methods of commons-io called in Spark > > > Key: SPARK-48583 > URL: https://issues.apache.org/jira/browse/SPARK-48583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Major > Labels: pull-request-available > > There are some deprecated classes and methods in commons-io called in Spark, > we need to replace them: > * writeStringToFile(final File file, final String data) > * CountingInputStream -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark
[ https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48583: Description: There are some deprecated classes and methods in commons-io called in Spark, we need to replace them: * writeStringToFile(final File file, final String data) * CountingInputStream was: There are some deprecated classes and methods in `commons-io` called in Spark, we need to replace them: * `writeStringToFile(final File file, final String data); * `CountingInputStream` > Replace deprecated classes and methods of `commons-io` called in Spark > -- > > Key: SPARK-48583 > URL: https://issues.apache.org/jira/browse/SPARK-48583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Major > Labels: pull-request-available > > There are some deprecated classes and methods in commons-io called in Spark, > we need to replace them: > * writeStringToFile(final File file, final String data) > * CountingInputStream -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark
[ https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48583: Description: There are some deprecated classes and methods in `commons-io` called in Spark, we need to replace them: * `writeStringToFile(final File file, final String data); * `CountingInputStream` was:Method `writeStringToFile(final File file, final String data)` in class `FileUtils` is deprecated, use `writeStringToFile(final File file, final String data, final Charset charset)` instead in UDFXPathUtilSuite. > Replace deprecated classes and methods of `commons-io` called in Spark > -- > > Key: SPARK-48583 > URL: https://issues.apache.org/jira/browse/SPARK-48583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Major > Labels: pull-request-available > > There are some deprecated classes and methods in `commons-io` called in > Spark, we need to replace them: > * `writeStringToFile(final File file, final String data); > * `CountingInputStream` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark
[ https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48583: Description: There are some deprecated classes and methods in `commons-io` called in Spark, we need to replace them: * `writeStringToFile(final File file, final String data); * `CountingInputStream` was: There are some deprecated classes and methods in `commons-io` called in Spark, we need to replace them: * `writeStringToFile(final File file, final String data); * `CountingInputStream` > Replace deprecated classes and methods of `commons-io` called in Spark > -- > > Key: SPARK-48583 > URL: https://issues.apache.org/jira/browse/SPARK-48583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Major > Labels: pull-request-available > > There are some deprecated classes and methods in `commons-io` called in > Spark, we need to replace them: > * `writeStringToFile(final File file, final String data); > * `CountingInputStream` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark
[ https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48583: Summary: Replace deprecated classes and methods of `commons-io` called in Spark (was: Replace deprecated `FileUtils#writeStringToFile` ) > Replace deprecated classes and methods of `commons-io` called in Spark > -- > > Key: SPARK-48583 > URL: https://issues.apache.org/jira/browse/SPARK-48583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Major > Labels: pull-request-available > > Method `writeStringToFile(final File file, final String data)` in class > `FileUtils` is deprecated, use `writeStringToFile(final File file, final > String data, final Charset charset)` instead in UDFXPathUtilSuite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48583) Replace deprecated `FileUtils#writeStringToFile`
Wei Guo created SPARK-48583: --- Summary: Replace deprecated `FileUtils#writeStringToFile` Key: SPARK-48583 URL: https://issues.apache.org/jira/browse/SPARK-48583 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wei Guo Method `writeStringToFile(final File file, final String data)` in class `FileUtils` is deprecated, use `writeStringToFile(final File file, final String data, final Charset charset)` instead in UDFXPathUtilSuite. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48581) Upgrade dropwizard metrics to 4.2.26
[ https://issues.apache.org/jira/browse/SPARK-48581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-48581: Summary: Upgrade dropwizard metrics to 4.2.26 (was: Upgrade dropwizard metrics 4.2.26) > Upgrade dropwizard metrics to 4.2.26 > > > Key: SPARK-48581 > URL: https://issues.apache.org/jira/browse/SPARK-48581 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Wei Guo >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48581) Upgrade dropwizard metrics 4.2.26
Wei Guo created SPARK-48581: --- Summary: Upgrade dropwizard metrics 4.2.26 Key: SPARK-48581 URL: https://issues.apache.org/jira/browse/SPARK-48581 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Wei Guo -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48539) Upgrade docker-java to 3.3.6
Wei Guo created SPARK-48539: --- Summary: Upgrade docker-java to 3.3.6 Key: SPARK-48539 URL: https://issues.apache.org/jira/browse/SPARK-48539 Project: Spark Issue Type: Improvement Components: Spark Docker Affects Versions: 4.0.0 Reporter: Wei Guo Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47259) Assign classes to interval errors
[ https://issues.apache.org/jira/browse/SPARK-47259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850238#comment-17850238 ] Wei Guo commented on SPARK-47259: - Update `_LEGACY_ERROR_TEMP_32[08-14]` to `_LEGACY_ERROR_TEMP_32[09-14]`, because ` _LEGACY_ERROR_TEMP_3208` is not related to interval errors. > Assign classes to interval errors > - > > Key: SPARK-47259 > URL: https://issues.apache.org/jira/browse/SPARK-47259 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47259) Assign classes to interval errors
[ https://issues.apache.org/jira/browse/SPARK-47259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-47259: Description: Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* defined in {*}core/src/main/resources/error/error-classes.json{*}. The name should be short but complete (look at the example in error-classes.json). Add a test which triggers the error from user code if such test still doesn't exist. Check exception fields by using {*}checkError(){*}. The last function checks valuable error fields only, and avoids dependencies from error text message. In this way, tech editors can modify error format in error-classes.json, and don't worry of Spark's internal tests. Migrate other tests that might trigger the error onto checkError(). If you cannot reproduce the error from user space (using SQL query), replace the error by an internal error, see {*}SparkException.internalError(){*}. Improve the error message format in error-classes.json if the current is not clear. Propose a solution to users how to avoid and fix such kind of errors. Please, look at the PR below as examples: * [https://github.com/apache/spark/pull/38685] * [https://github.com/apache/spark/pull/38656] * [https://github.com/apache/spark/pull/38490] was: Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[08-14]* defined in {*}core/src/main/resources/error/error-classes.json{*}. The name should be short but complete (look at the example in error-classes.json). Add a test which triggers the error from user code if such test still doesn't exist. Check exception fields by using {*}checkError(){*}. The last function checks valuable error fields only, and avoids dependencies from error text message. In this way, tech editors can modify error format in error-classes.json, and don't worry of Spark's internal tests. Migrate other tests that might trigger the error onto checkError(). If you cannot reproduce the error from user space (using SQL query), replace the error by an internal error, see {*}SparkException.internalError(){*}. Improve the error message format in error-classes.json if the current is not clear. Propose a solution to users how to avoid and fix such kind of errors. Please, look at the PR below as examples: * [https://github.com/apache/spark/pull/38685] * [https://github.com/apache/spark/pull/38656] * [https://github.com/apache/spark/pull/38490] > Assign classes to interval errors > - > > Key: SPARK-47259 > URL: https://issues.apache.org/jira/browse/SPARK-47259 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40678) JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13
[ https://issues.apache.org/jira/browse/SPARK-40678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687573#comment-17687573 ] Wei Guo commented on SPARK-40678: - Fixed by PR 38154 https://github.com/apache/spark/pull/38154 > JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13 > > > Key: SPARK-40678 > URL: https://issues.apache.org/jira/browse/SPARK-40678 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.2.0 >Reporter: Cédric Chantepie >Priority: Major > > In Spark 3.2 (Scala 2.13), values with {{ArrayType}} are no longer properly > support with JSON; e.g. > {noformat} > import org.apache.spark.sql.SparkSession > case class KeyValue(key: String, value: Array[Byte]) > val spark = > SparkSession.builder().master("local[1]").appName("test").getOrCreate() > import spark.implicits._ > val df = Seq(Array(KeyValue("foo", "bar".getBytes))).toDF() > df.foreach(r => println(r.json)) > {noformat} > Expected: > {noformat} > [{foo, bar}] > {noformat} > Encountered: > {noformat} > java.lang.IllegalArgumentException: Failed to convert value > ArraySeq([foo,[B@dcdb68f]) (class of class > scala.collection.mutable.ArraySeq$ofRef}) with the type of > ArrayType(Seq(StructField(key,StringType,false), > StructField(value,BinaryType,false)),true) to JSON. > at org.apache.spark.sql.Row.toJson$1(Row.scala:604) > at org.apache.spark.sql.Row.jsonValue(Row.scala:613) > at org.apache.spark.sql.Row.jsonValue$(Row.scala:552) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.jsonValue(rows.scala:166) > at org.apache.spark.sql.Row.json(Row.scala:535) > at org.apache.spark.sql.Row.json$(Row.scala:535) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.json(rows.scala:166) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39348) Create table in overwrite mode fails when interrupted
[ https://issues.apache.org/jira/browse/SPARK-39348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686390#comment-17686390 ] Wei Guo commented on SPARK-39348: - After PR [https://github.com/apache/spark/pull/26559,] it has been removed. * Since Spark 2.4, creating a managed table with nonempty location is not allowed. An exception is thrown when attempting to create a managed table with nonempty location. To set {{true}} to {{spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation}} restores the previous behavior. This option will be removed in Spark 3.0. > Create table in overwrite mode fails when interrupted > - > > Key: SPARK-39348 > URL: https://issues.apache.org/jira/browse/SPARK-39348 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.1 >Reporter: Max >Priority: Major > > When you attempt to rerun an Apache Spark write operation by cancelling the > currently running job, the following error occurs: > {code:java} > Error: org.apache.spark.sql.AnalysisException: Cannot create the managed > table('`testdb`.` testtable`'). > The associated location > ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already > exists.;{code} > This problem can occur if: > * The cluster is terminated while a write operation is in progress. > * A temporary network issue occurs. > * The job is interrupted. > You can reproduce the problem by following these steps: > 1. Create a DataFrame: > {code:java} > val df = spark.range(1000){code} > 2. Write the DataFrame to a location in overwrite mode: > {code:java} > df.write.mode(SaveMode.Overwrite).saveAsTable("testdb.testtable"){code} > 3. Cancel the command while it is executing. > 4. Re-run the {{write}} command. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. After this change, the behavior as flows: |id|code|2.4 and before|3.0 and after|this update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write{color:#57d9a3}.option("comment", "#"){color}.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|default behavior: the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.{color:#57d9a3}option("comment", "#"){color}.csv(path)|\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)|#abc xyz|#abc \udef xyz|#abc \udef xyz|default behavior: the same| was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. After this change, the behavior as flows: |id|code|2.4 and before|3.0 and after|this update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)|#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)|#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment",
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. |id|code|2.4 and before|3.0 and after|this update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)|#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)|#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)|\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)|#abc xyz|#abc \udef xyz|#abc \udef xyz|the same| was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. |id|code| |2.4 and before|3.0 and after|current update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)| |#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)| |#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)| |#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)| |#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)| |\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)| |#abc xyz|#abc \udef xyz|#abc \udef xyz|the same| > Pass the comment option
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. After this change, the behavior as flows: |id|code|2.4 and before|3.0 and after|this update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)|#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)|#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)|\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)|#abc xyz|#abc \udef xyz|#abc \udef xyz|the same| was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. |id|code|2.4 and before|3.0 and after|this update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)|#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)|#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)|#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit difference with 3.0{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)|\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)|#abc xyz|#abc
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. |id|code| |2.4 and before|3.0 and after|current update|remark| |1|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "\u").csv(path)| |#abc *def* xyz|{color:#4c9aff}"#abc"{color} {color:#4c9aff}*def*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}*"def"*{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}| |2|Seq("#abc", "\udef", "xyz").toDF() .write.option("comment", "#").csv(path)| |#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |3|Seq("#abc", "\udef", "xyz").toDF() .write.csv(path)| |#abc *def* xyz|"#abc" *def* xyz|"#abc" *def* xyz|the same| |4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "\u").csv(path)| |#abc xyz|{color:#4c9aff}#abc{color} {color:#4c9aff}\udef{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color} {color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}| |5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.option("comment", "#").csv(path)| |\udef xyz|\udef xyz|\udef xyz|the same| |6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path) spark.read.csv(path)| |#abc xyz|#abc \udef xyz|#abc \udef xyz|the same| was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. > Pass the comment option through to univocity if users set it explicitly in > CSV dataSource > - > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-56-01-596.png! > After this change, the content is shown as: > !image-2023-02-03-18-56-10-083.png! > For users, they can't set comment option
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Summary: Pass the comment option through to univocity if users set it explicitly in CSV dataSource (was: Pass the comment option through to univocity if users set it explicity in CSV dataSource) > Pass the comment option through to univocity if users set it explicitly in > CSV dataSource > - > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-56-01-596.png! > After this change, the content is shown as: > !image-2023-02-03-18-56-10-083.png! > For users, they can't set comment option to '\u' to keep the behavior as > before because the new added `isCommentSet` check logic as follows: > {code:java} > val isCommentSet = this.comment != '\u' > def asWriterSettings: CsvWriterSettings = { > // other code > if (isCommentSet) { > format.setComment(comment) > } > // other code > } > {code} > It's better to pass the comment option through to univocity if users set it > explicitly in CSV dataSource. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource. was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to Univocity if users set it explicitly in CSV dataSource. > Pass the comment option through to univocity if users set it explicity in CSV > dataSource > > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-56-01-596.png! > After this change, the content is shown as: > !image-2023-02-03-18-56-10-083.png! > For users, they can't set comment option to '\u' to keep the behavior as > before because the new added `isCommentSet` check logic as follows: > {code:java} > val isCommentSet = this.comment != '\u' > def asWriterSettings: CsvWriterSettings = { > // other code > if (isCommentSet) { > format.setComment(comment) > } > // other code > } > {code} > It's better to pass the comment option through to univocity if users set it > explicitly in CSV dataSource. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior as before because the new added `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} It's better to pass the comment option through to Univocity if users set it explicitly in CSV dataSource. was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior before because the `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} until univocity-parses releases a new version. > Pass the comment option through to univocity if users set it explicity in CSV > dataSource > > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-56-01-596.png! > After this change, the content is shown as: > !image-2023-02-03-18-56-10-083.png! > For users, they can't set comment option to '\u' to keep the behavior as > before because the new added `isCommentSet` check logic as follows: > {code:java} > val isCommentSet = this.comment != '\u' > def asWriterSettings: CsvWriterSettings = { > // other code > if (isCommentSet) { > format.setComment(comment) > } > // other code > } > {code} > It's better to pass the comment option through to Univocity if users set it > explicitly in CSV dataSource. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior before because the `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { // other code if (isCommentSet) { format.setComment(comment) } // other code } {code} until univocity-parses releases a new version. was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior before because the `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { xx if (isCommentSet) { format.setComment(comment) } } {code} until univocity-parses releases a new version. > Pass the comment option through to univocity if users set it explicity in CSV > dataSource > > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-56-01-596.png! > After this change, the content is shown as: > !image-2023-02-03-18-56-10-083.png! > For users, they can't set comment option to '\u' to keep the behavior > before because the `isCommentSet` check logic as follows: > {code:java} > val isCommentSet = this.comment != '\u' > def asWriterSettings: CsvWriterSettings = { > // other code > if (isCommentSet) { > format.setComment(comment) > } > // other code > } > {code} > until univocity-parses releases a new version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For users, they can't set comment option to '\u' to keep the behavior before because the `isCommentSet` check logic as follows: {code:java} val isCommentSet = this.comment != '\u' def asWriterSettings: CsvWriterSettings = { xx if (isCommentSet) { format.setComment(comment) } } {code} until univocity-parses releases a new version. was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For Spark, it's better to add a legacy config to restores the legacy behavior until univocity-parses releases a new version. > Pass the comment option through to univocity if users set it explicity in CSV > dataSource > > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-56-01-596.png! > After this change, the content is shown as: > !image-2023-02-03-18-56-10-083.png! > For users, they can't set comment option to '\u' to keep the behavior > before because the `isCommentSet` check logic as follows: > {code:java} > val isCommentSet = this.comment != '\u' > def asWriterSettings: CsvWriterSettings = { > xx > if (isCommentSet) { > format.setComment(comment) > } > } > {code} > until univocity-parses releases a new version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! For Spark, it's better to add a legacy config to restores the legacy behavior until univocity-parses releases a new version. was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for issue [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505] to univocity-parses, but it seems to be a long time for waiting it to be merged. For Spark, it's better to add a legacy config to restores the legacy behavior until univocity-parses releases a new version. > Pass the comment option through to univocity if users set it explicity in CSV > dataSource > > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-56-01-596.png! > After this change, the content is shown as: > !image-2023-02-03-18-56-10-083.png! > For Spark, it's better to add a legacy config to restores the legacy behavior > until univocity-parses releases a new version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Summary: Pass the comment option through to univocity if users set it explicity in CSV dataSource (was: Add a legacy config for restoring writer's comment option behavior in CSV dataSource) > Pass the comment option through to univocity if users set it explicity in CSV > dataSource > > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-56-01-596.png! > After this change, the content is shown as: > !image-2023-02-03-18-56-10-083.png! > I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] > for issue > [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505] > to univocity-parses, but it seems to be a long time for waiting it to be > merged. > > For Spark, it's better to add a legacy config to restores the legacy behavior > until univocity-parses releases a new version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
[ https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42252: Fix Version/s: 3.5.0 (was: 3.4.0) > Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config > -- > > Key: SPARK-42252 > URL: https://issues.apache.org/jira/browse/SPARK-42252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.5.0 > > > After Jira SPARK-28209 and PR > [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer > api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, > SortShuffleWriter, UnsafeShuffleWriter) are based on > LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config > spark.shuffle.unsafe.file.output.buffer used in > LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. > > It's better to rename it and make it more suitable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
[ https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42252: Affects Version/s: 3.3.0 3.2.0 3.1.0 3.4.0 > Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config > -- > > Key: SPARK-42252 > URL: https://issues.apache.org/jira/browse/SPARK-42252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.5.0 > > > After Jira SPARK-28209 and PR > [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer > api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, > SortShuffleWriter, UnsafeShuffleWriter) are based on > LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config > spark.shuffle.unsafe.file.output.buffer used in > LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. > > It's better to rename it and make it more suitable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
[ https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42252: Target Version/s: 3.5.0 (was: 3.4.0) > Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config > -- > > Key: SPARK-42252 > URL: https://issues.apache.org/jira/browse/SPARK-42252 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > > After Jira SPARK-28209 and PR > [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer > api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, > SortShuffleWriter, UnsafeShuffleWriter) are based on > LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config > spark.shuffle.unsafe.file.output.buffer used in > LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. > > It's better to rename it and make it more suitable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Attachment: image-2023-02-03-18-56-10-083.png > Add a legacy config for restoring writer's comment option behavior in CSV > dataSource > > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-44-30-296.png! > After this change, the content is shown as: > !image-2023-02-03-18-15-12-661.png! > I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] > for issue > [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505] > to univocity-parses, but it seems to be a long time for waiting it to be > merged. > > For Spark, it's better to add a legacy config to restores the legacy behavior > until univocity-parses releases a new version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Description: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-56-01-596.png! After this change, the content is shown as: !image-2023-02-03-18-56-10-083.png! I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for issue [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505] to univocity-parses, but it seems to be a long time for waiting it to be merged. For Spark, it's better to add a legacy config to restores the legacy behavior until univocity-parses releases a new version. was: In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-44-30-296.png! After this change, the content is shown as: !image-2023-02-03-18-15-12-661.png! I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for issue [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505] to univocity-parses, but it seems to be a long time for waiting it to be merged. For Spark, it's better to add a legacy config to restores the legacy behavior until univocity-parses releases a new version. > Add a legacy config for restoring writer's comment option behavior in CSV > dataSource > > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-56-01-596.png! > After this change, the content is shown as: > !image-2023-02-03-18-56-10-083.png! > I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] > for issue > [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505] > to univocity-parses, but it seems to be a long time for waiting it to be > merged. > > For Spark, it's better to add a legacy config to restores the legacy behavior > until univocity-parses releases a new version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource
[ https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42335: Attachment: image-2023-02-03-18-56-01-596.png > Add a legacy config for restoring writer's comment option behavior in CSV > dataSource > > > Key: SPARK-42335 > URL: https://issues.apache.org/jira/browse/SPARK-42335 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-02-03-18-56-01-596.png, > image-2023-02-03-18-56-10-083.png > > > In PR [https://github.com/apache/spark/pull/29516], in order to fix some > bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to > 2.9.0, it also involved a new feature of univocity-parsers that quoting > values of the first column that start with the comment character. It made a > breaking for users downstream that handing a whole row as input. > > For codes: > {code:java} > Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} > Before Spark 3.0,the content of output CSV files is shown as: > !image-2023-02-03-18-44-30-296.png! > After this change, the content is shown as: > !image-2023-02-03-18-15-12-661.png! > I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] > for issue > [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505] > to univocity-parses, but it seems to be a long time for waiting it to be > merged. > > For Spark, it's better to add a legacy config to restores the legacy behavior > until univocity-parses releases a new version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource
Wei Guo created SPARK-42335: --- Summary: Add a legacy config for restoring writer's comment option behavior in CSV dataSource Key: SPARK-42335 URL: https://issues.apache.org/jira/browse/SPARK-42335 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.0.0 Reporter: Wei Guo Fix For: 3.4.0 In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input. For codes: {code:java} Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code} Before Spark 3.0,the content of output CSV files is shown as: !image-2023-02-03-18-44-30-296.png! After this change, the content is shown as: !image-2023-02-03-18-15-12-661.png! I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for issue [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505] to univocity-parses, but it seems to be a long time for waiting it to be merged. For Spark, it's better to add a legacy config to restores the legacy behavior until univocity-parses releases a new version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format
[ https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42237: Description: When a binary colunm is written into csv files, actual content of this colunm is {*}object.toString(){*}, which is meaningless. {code:java} val df = Seq(Array[Byte](1,2)).toDF df.write.csv("/Users/guowei/Desktop/binary_csv") {code} The csv file's content is as follows: !image-2023-01-30-17-21-09-212.png|width=141,height=29! Meanwhile, if a binary colunm saved as table with csv fileformat, the table can't be read back successfully. {code:java} val df = Seq((1, Array[Byte](1,2))).toDF df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from binaryDataTable").show() {code} !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A! So I think it' better to change binary to unsupported dataType in csv format, both for datasource v1(CSVFileFormat) and v2(CSVTable). was: When a binary colunm is written into csv files, actual content of this colunm is {*}object.toString(){*}, which is meaningless. {code:java} val df = Seq(Array[Byte](1,2)).toDF df.write.csv("/Users/guowei19/Desktop/binary_csv") {code} The csv file's content is as follows: !image-2023-01-30-17-21-09-212.png|width=141,height=29! Meanwhile, if a binary colunm saved as table with csv fileformat, the table can't be read back successfully. {code:java} val df = Seq((1, Array[Byte](1,2))).toDF df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from binaryDataTable").show() {code} !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A! So I think it' better to change binary to unsupported dataType in csv format, both for datasource v1(CSVFileFormat) and v2(CSVTable). > change binary to unsupported dataType in csv format > --- > > Key: SPARK-42237 > URL: https://issues.apache.org/jira/browse/SPARK-42237 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.3.1 >Reporter: Wei Guo >Priority: Minor > Attachments: image-2023-01-30-17-21-09-212.png > > > When a binary colunm is written into csv files, actual content of this colunm > is {*}object.toString(){*}, which is meaningless. > {code:java} > val df = Seq(Array[Byte](1,2)).toDF > df.write.csv("/Users/guowei/Desktop/binary_csv") > {code} > The csv file's content is as follows: > !image-2023-01-30-17-21-09-212.png|width=141,height=29! > Meanwhile, if a binary colunm saved as table with csv fileformat, the table > can't be read back successfully. > {code:java} > val df = Seq((1, Array[Byte](1,2))).toDF > df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from > binaryDataTable").show() > {code} > !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A! > So I think it' better to change binary to unsupported dataType in csv format, > both for datasource v1(CSVFileFormat) and v2(CSVTable). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
Wei Guo created SPARK-42252: --- Summary: Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config Key: SPARK-42252 URL: https://issues.apache.org/jira/browse/SPARK-42252 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Wei Guo Fix For: 3.4.0 After Jira SPARK-28209 and PR [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, SortShuffleWriter, UnsafeShuffleWriter) are based on LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config spark.shuffle.unsafe.file.output.buffer used in LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. It's better to rename it and make it more suitable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42237) change binary to unsupported dataType in csv format
[ https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17681978#comment-17681978 ] Wei Guo commented on SPARK-42237: - a pr is ready~ > change binary to unsupported dataType in csv format > --- > > Key: SPARK-42237 > URL: https://issues.apache.org/jira/browse/SPARK-42237 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.3.1 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-01-30-17-21-09-212.png > > > When a binary colunm is written into csv files, actual content of this colunm > is {*}object.toString(){*}, which is meaningless. > {code:java} > val df = > Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") > {code} > The csv file's content is as follows: > !image-2023-01-30-17-21-09-212.png|width=141,height=29! > Meanwhile, if a binary colunm saved as table with csv fileformat, the table > can't be read back successfully. > {code:java} > val df = Seq((1, > Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select > * from binaryDataTable").show() {code} > !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A! > So I think it' better to change binary to unsupported dataType in csv format, > both for datasource v1(CSVFileFormat) and v2(CSVTable). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-42237) change binary to unsupported dataType in csv format
[ https://issues.apache.org/jira/browse/SPARK-42237 ] Wei Guo deleted comment on SPARK-42237: - was (Author: wayne guo): a pr is ready~ > change binary to unsupported dataType in csv format > --- > > Key: SPARK-42237 > URL: https://issues.apache.org/jira/browse/SPARK-42237 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.3.1 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-01-30-17-21-09-212.png > > > When a binary colunm is written into csv files, actual content of this colunm > is {*}object.toString(){*}, which is meaningless. > {code:java} > val df = > Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") > {code} > The csv file's content is as follows: > !image-2023-01-30-17-21-09-212.png|width=141,height=29! > Meanwhile, if a binary colunm saved as table with csv fileformat, the table > can't be read back successfully. > {code:java} > val df = Seq((1, > Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select > * from binaryDataTable").show() {code} > !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A! > So I think it' better to change binary to unsupported dataType in csv format, > both for datasource v1(CSVFileFormat) and v2(CSVTable). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format
[ https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42237: Description: When a binary colunm is written into csv files, actual content of this colunm is {*}object.toString(){*}, which is meaningless. {code:java} val df = Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") {code} The csv file's content is as follows: !image-2023-01-30-17-21-09-212.png|width=141,height=29! Meanwhile, if a binary colunm saved as table with csv fileformat, the table can't be read back successfully. {code:java} val df = Seq((1, Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from binaryDataTable").show() {code} !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A! So I think it' better to change binary to unsupported dataType in csv format, both for datasource v1(CSVFileFormat) and v2(CSVTable). was: When a binary colunm is written into csv files, actual content of this colunm is {*}object.toString(){*}, which is meaningless. {code:java} val df = Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") {code} The csv file's content is as follows: !image-2023-01-30-17-18-16-372.png|width=104,height=21! Meanwhile, if a binary colunm saved as table with csv fileformat, the table can't be read back successfully. {code:java} val df = Seq((1, Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from binaryDataTable").show() {code} !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A! So I think it' better to change binary to unsupported dataType in csv format, both for datasource v1(CSVFileFormat) and v2(CSVTable). > change binary to unsupported dataType in csv format > --- > > Key: SPARK-42237 > URL: https://issues.apache.org/jira/browse/SPARK-42237 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.3.1 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-01-30-17-21-09-212.png > > > When a binary colunm is written into csv files, actual content of this colunm > is {*}object.toString(){*}, which is meaningless. > {code:java} > val df = > Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") > {code} > The csv file's content is as follows: > !image-2023-01-30-17-21-09-212.png|width=141,height=29! > Meanwhile, if a binary colunm saved as table with csv fileformat, the table > can't be read back successfully. > {code:java} > val df = Seq((1, > Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select > * from binaryDataTable").show() {code} > !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A! > So I think it' better to change binary to unsupported dataType in csv format, > both for datasource v1(CSVFileFormat) and v2(CSVTable). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format
[ https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-42237: Attachment: image-2023-01-30-17-21-09-212.png > change binary to unsupported dataType in csv format > --- > > Key: SPARK-42237 > URL: https://issues.apache.org/jira/browse/SPARK-42237 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.3.1 >Reporter: Wei Guo >Priority: Minor > Fix For: 3.4.0 > > Attachments: image-2023-01-30-17-21-09-212.png > > > When a binary colunm is written into csv files, actual content of this colunm > is {*}object.toString(){*}, which is meaningless. > {code:java} > val df = > Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") > {code} > The csv file's content is as follows: > !image-2023-01-30-17-18-16-372.png|width=104,height=21! > Meanwhile, if a binary colunm saved as table with csv fileformat, the table > can't be read back successfully. > {code:java} > val df = Seq((1, > Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select > * from binaryDataTable").show() {code} > !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A! > So I think it' better to change binary to unsupported dataType in csv format, > both for datasource v1(CSVFileFormat) and v2(CSVTable). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42237) change binary to unsupported dataType in csv format
Wei Guo created SPARK-42237: --- Summary: change binary to unsupported dataType in csv format Key: SPARK-42237 URL: https://issues.apache.org/jira/browse/SPARK-42237 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.1, 2.4.8 Reporter: Wei Guo Fix For: 3.4.0 When a binary colunm is written into csv files, actual content of this colunm is {*}object.toString(){*}, which is meaningless. {code:java} val df = Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") {code} The csv file's content is as follows: !image-2023-01-30-17-18-16-372.png|width=104,height=21! Meanwhile, if a binary colunm saved as table with csv fileformat, the table can't be read back successfully. {code:java} val df = Seq((1, Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from binaryDataTable").show() {code} !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A! So I think it' better to change binary to unsupported dataType in csv format, both for datasource v1(CSVFileFormat) and v2(CSVTable). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature
[ https://issues.apache.org/jira/browse/SPARK-39901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572502#comment-17572502 ] Wei Guo commented on SPARK-39901: - The `ignoreCorruptFiles` features in SQL(spark.sql.files.ignoreCorruptFiles) and RDD(spark.files.ignoreCorruptFiles) scenarios need to be included both. > Reconsider design of ignoreCorruptFiles feature > --- > > Key: SPARK-39901 > URL: https://issues.apache.org/jira/browse/SPARK-39901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > I'm filing this ticket as a followup to the discussion at > [https://github.com/apache/spark/pull/36775#issuecomment-1148136217] > regarding the `ignoreCorruptFiles` feature: the current implementation is > based towards considering a broad range of IOExceptions to be corruption, but > this is likely overly-broad and might mis-identify transient errors as > corruption (causing non-corrupt data to be erroneously discarded). > SPARK-39389 fixes one instance of that problem, but we are still vulnerable > to similar issues because of the overall design of this feature. > I think we should reconsider the design of this feature: maybe we should > switch the default behavior so that only an explicit allowlist of known > corruption exceptions can cause files to be skipped. This could be done > through involvement of other parts of the code, e.g. rewrapping exceptions > into a `CorruptFileException` so higher layers can positively identify > corruption. > Any changes to behavior here could potentially impact users jobs, so we'd > need to think carefully about when we want to change (in a 3.x release? 4.x?) > and how we want to provide escape hatches (e.g. configs to revert back to old > behavior). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37575) null values should be saved as nothing rather than quoted empty Strings "" with default settings
[ https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37575: Description: As mentioned in sql migration guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), {noformat} Since Spark 2.4, empty strings are saved as quoted empty strings "". In version 2.3 and earlier, empty strings are equal to null values and do not reflect to any characters in saved CSV files. For example, the row of "a", null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as a,,"",1. To restore the previous behavior, set the CSV option emptyValue to empty (not quoted) string.{noformat} But actually, both empty strings and null values are saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)。 code: {code:java} val data = List("spark", null, "").toDF("name") data.coalesce(1).write.csv("spark_csv_test") {code} actual result: {noformat} line1: spark line2: "" line3: ""{noformat} expected result: {noformat} line1: spark line2: line3: "" {noformat} was: As mentioned in sql migration guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), {noformat} Since Spark 2.4, empty strings are saved as quoted empty strings "". In version 2.3 and earlier, empty strings are equal to null values and do not reflect to any characters in saved CSV files. For example, the row of "a", null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as a,,"",1. To restore the previous behavior, set the CSV option emptyValue to empty (not quoted) string.{noformat} But actually, both empty strings and null values are saved as quoted empty Strings "" rather than "" (for empty strings) and nothing(for null values)。 code: {code:java} val data = List("spark", null, "").toDF("name") data.coalesce(1).write.csv("spark_csv_test") {code} actual result: {noformat} line1: spark line2: "" line3: ""{noformat} expected result: {noformat} line1: spark line2: line3: "" {noformat} > null values should be saved as nothing rather than quoted empty Strings "" > with default settings > > > Key: SPARK-37575 > URL: https://issues.apache.org/jira/browse/SPARK-37575 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Assignee: Wei Guo >Priority: Major > Fix For: 3.3.0 > > > As mentioned in sql migration > guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]), > {noformat} > Since Spark 2.4, empty strings are saved as quoted empty strings "". In > version 2.3 and earlier, empty strings are equal to null values and do not > reflect to any characters in saved CSV files. For example, the row of "a", > null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as > a,,"",1. To restore the previous behavior, set the CSV option emptyValue to > empty (not quoted) string.{noformat} > But actually, both empty strings and null values are saved as quoted empty > Strings "" rather than "" (for empty strings) and nothing(for null values)。 > code: > {code:java} > val data = List("spark", null, "").toDF("name") > data.coalesce(1).write.csv("spark_csv_test") > {code} > actual result: > {noformat} > line1: spark > line2: "" > line3: ""{noformat} > expected result: > {noformat} > line1: spark > line2: > line3: "" > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo resolved SPARK-37604. - Resolution: Not A Problem > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue strings rather than to convert both "\"\""(quoted > empty strings) and emptyValue strings to ""(empty) in dataframe. > I think it's better that if we
[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461525#comment-17461525 ] Wei Guo commented on SPARK-37604: - Well, I think your explanation is clearly and reasonable and it convinced me. So I'll close this issue and the PR related. Thank you! > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue
[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461393#comment-17461393 ] Wei Guo commented on SPARK-37604: - As the consideration of Hyukjin Kwon in the PR related, if we worry about making a breaking change, we can add a new option to support it. > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue
[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461390#comment-17461390 ] Wei Guo commented on SPARK-37604: - In short, for null values, we can save null values in dataframe as "NULL" strings in csv files, and read back "NULL" strings as null values with the same nullValue option("NULL"). But for empty values, if we save empty values in dataframe as "EMPTY" strings in csv files, we can not read back "EMPTY" strings as empty values with the same emptyValue("EMPTY"), we finally get "EMPTY" strings. [~maxgekk] > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) >
[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Issue Type: Improvement (was: Bug) > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue strings rather than to convert both "\"\""(quoted > empty strings) and emptyValue strings to ""(empty) in dataframe. > I think it's better
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !empty_test.png|width=701,height=286! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !empty_test.png! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !empty_test.png! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for
[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Attachment: (was: image-2021-12-16-01-57-55-864.png) > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue strings rather than to convert both "\"\""(quoted > empty strings) and emptyValue strings to ""(empty) in dataframe. > I think
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:04 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible. FYI. [~maxgekk] was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for
[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM: For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! For null values, we can write and read back with the same nullValue option, but for empty strings, even with same emptyValue option, it's irreversible was (Author: wayne guo): For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the
[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144 ] Wei Guo commented on SPARK-37604: - For codes: {code:scala} val data = Seq(("Tesla", "")).toDF("make", "comment") data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty") {code} The csv file's content is as: {noformat} Tesla,EMPTY {noformat} (cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv) When I read it back to dataframe: {code:scala} spark.read.option("emptyValue", "EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show() {code} I want the column *comment* is "" rather a "EMPTY" string. !image-2021-12-16-01-57-55-864.png|width=424,height=173! > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in
[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Attachment: empty_test.png > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue strings rather than to convert both "\"\""(quoted > empty strings) and emptyValue strings to ""(empty) in dataframe. > I
[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Attachment: image-2021-12-16-01-57-55-864.png > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png > > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty > strings) to emptyValue strings rather than to convert both "\"\""(quoted > empty strings) and emptyValue strings to ""(empty)
[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Summary: Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files (was: The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading) > Change emptyValueInRead's effect to that any fields matching this string will > be set as "" when reading csv files > - > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we
[jira] [Comment Edited] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460091#comment-17460091 ] Wei Guo edited comment on SPARK-37604 at 12/15/21, 4:48 PM: [~hyukjin.kwon], [~maxgekk] Shall we have a simple discussion about it in your free time, I'd like to hear your thoughts on this. was (Author: wayne guo): [~hyukjin.kwon][~maxgekk] Shall we have a simple discussion about it in your free time, I'd like to hear your thoughts on this. > The option emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading > -- > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious
[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460091#comment-17460091 ] Wei Guo commented on SPARK-37604: - [~hyukjin.kwon][~maxgekk] Shall we have a simple discussion about it in your free time, I'd like to hear your thoughts on this. > The option emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading > -- > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For nullValue, we will convert nothing or nullValue strings to null > in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty
[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460084#comment-17460084 ] Wei Guo commented on SPARK-37604: - Maybe this issue is not a notable bug or promotion, but, for users' common usage, they prefer to be able to convert these emptyValue strings in csv files into ""(empty strings) again after writing out empty strings as emptyValue strings rather than current behaviors. > The option emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading > -- > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue in read > handling. For
[jira] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604 ] Wei Guo deleted comment on SPARK-37604: - was (Author: wayne guo): The current behavior of emptyValueInRead is more like the function for null values in Dataset: {code:scala} dataframe.na.fill(fillMap){code} So, we can also provide a function in Dataset similar to it, such as: {code:scala} dataframe.empty.fill(fillMap){code} rather than to change empty strings to emptyValueInRead in DataFrame when reading csv files. > The option emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading > -- > > Key: SPARK-37604 > URL: https://issues.apache.org/jira/browse/SPARK-37604 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.2.0 >Reporter: Wei Guo >Priority: Major > > The csv data format is imported from databricks > [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with > PR [10766|https://github.com/apache/spark/pull/10766] . > {*}For the nullValue option{*}, according to features described in spark-csv > readme file, it's designed as: > {noformat} > When reading files: > nullValue: specifies a string that indicates a null value, any fields > matching this string will be set as nulls in the DataFrame > When writing files: > nullValue: specifies a string that indicates a null value, nulls in the > DataFrame will be written as this string. > {noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", > "NULL").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,NULL > {noformat} > When reading: > {code:scala} > spark.read.option("nullValue", "NULL").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|null| > We can find that null columns in dataframe can be saved as "NULL" strings in > csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as > null columns*{color} in dataframe. That is: > {noformat} > When writing, convert null(in dataframe) to nullValue(in csv) > When reading, convert nullValue or nothing(in csv) to null(in dataframe) > {noformat} > But actually, the option nullValue in depended component univocity's > {*}_CommonSettings_{*}, is designed as that: > {noformat} > when reading, if the parser does not read any character from the input, the > nullValue is used instead of an empty string. > when writing, if the writer has a null object to write to the output, the > nullValue is used instead of an empty string.{noformat} > {*}There is a difference when reading{*}. In univocity, nothing content will > be convert to nullValue strings. But In Spark, we finally convert nothing > content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* > method: > {code:java} > private def nullSafeDatum( > datum: String, > name: String, > nullable: Boolean, > options: CSVOptions)(converter: ValueConverter): Any = { > if (datum == options.nullValue || datum == null) { > if (!nullable) { > throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) > } > null > } else { > converter.apply(datum) > } > } {code} > > From now, we start to talk about emptyValue. > {*}For the emptyValue option{*}, we add a emptyValueInRead option for > reading and a emptyValueInWrite option for writing. I found that Spark keeps > the same behaviors for emptyValue with univocity, that is: > {noformat} > When reading, if the parser does not read any character from the input, and > the input is within quotes, the empty is used instead of an empty string. > When writing, if the writer has an empty String to write to the output, the > emptyValue is used instead of an empty string.{noformat} > For example, when writing: > {code:scala} > Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", > "EMPTY").csv(path){code} > The saved csv file is shown as: > {noformat} > Tesla,EMPTY {noformat} > When reading: > {code:scala} > spark.read.option("emptyValue", "EMPTY").csv(path).show() > {code} > The parsed dataframe is shown as: > ||make||comment|| > |Tesla|EMPTY| > We can find that empty columns in dataframe can be saved as "EMPTY" strings > in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be > parsed as empty columns{color}* in dataframe. That is: > {noformat} > When writing, convert "" empty(in dataframe) to emptyValue(in csv) > When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) > {noformat} > > There is an obvious difference between nullValue and emptyValue
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:scala} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} {*}There is a difference when reading{*}. In univocity, nothing content will be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} >From now, we start to talk about emptyValue. {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that Spark keeps the same behaviors for emptyValue with univocity, that is: {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:scala} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is an obvious difference between nullValue and emptyValue in read handling. For nullValue, we will convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue strings rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that if we keep the similar behavior(try to recover emptyValue in csv files to "") for emptyValue as nullValue when reading. So, I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue:
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} {*}There is a difference when reading{*}. In univocity, nothing content will be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} >From now, we start to talk about emptyValue. {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that Spark keeps the same behaviors for emptyValue with univocity, that is: {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:scala} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} {*}There is a difference when reading{*}. In univocity, nothing content will be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} >From now, we start to talk about emptyValue. {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files:
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} {*}There is a difference when reading{*}. In univocity, nothing content will be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} {*}There is a difference when reading{*}. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null columns*{color} in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |Tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features described in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features description in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to features description in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it's designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so I suggest that the emptyValueInRead(in CSVOptions) should be designed as that any fields matching this string will be set as empty values "" when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior(try to recover emptyValue to "") for emptyValue as nullValue when reading, so was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla",
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think it's better that we keep the similar behavior for emptyValue as nullValue when reading. was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make",
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as "NULL" strings in csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null columns{color}* in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", {code} {color:#910091}""{color} {code:scala} )).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} When reading: {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|EMPTY| We can find that empty columns in dataframe can be saved as "EMPTY" strings in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed as empty columns{color}* in dataframe. That is: {noformat} When writing, convert "" empty(in dataframe) to emptyValue(in csv) When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe) {noformat} There is obvious difference between nullValue and emptyValue in read handling. For nullValue, we try to convert nothing or nullValue strings to null in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) and emptyValue strings to ""(empty) in dataframe. I think was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as:
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . {*}For the nullValue option{*}, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as NULL strings in csv files and NULL strings in csv files can be parsed as columns of null values in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string. when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string.{noformat} There is a difference when reading. In univocity, nothing content would be convert to nullValue strings. But In Spark, we finally convert nothing content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} {*}For the emptyValue option{*}, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that both Spark keeps the same behaviors for emptyValue with univocity. {noformat} When reading, if the parser does not read any character from the input, and the input is within quotes, the empty is used instead of an empty string. When writing, if the writer has an empty String to write to the output, the emptyValue is used instead of an empty string.{noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as NULL strings in csv files and NULL strings in csv files can be parsed as columns of null values in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} {color:#de350b}*We can not recovery it to the original DataFrame.*{color} was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . For the nullValue option, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . For the nullValue option, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as NULL strings in csv files and NULL strings in csv files can be parsed as columns of null values in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in depended component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string{noformat} There is a difference when reading. In univocity, nothing would be convert to nullValue strings. But In Spark, we finally convert nothing or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} For the emptyValue option, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. {noformat} *no* further _formatting_ is done here{noformat} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" string. {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} we actually get the DataFrame which data is shown as: ||make||comment|| |tesla|EMPTY| but the DataFrame which data should be shown as below as expected: ||make||comment|| |tesla| | I found that Spark keeps the same behavior with the depended component univocity. Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} {color:#de350b}*We can not recovery it to the original DataFrame.*{color} was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . For the nullValue option, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . For the nullValue option, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as NULL strings in csv files and NULL strings in csv files can be parsed as columns of null values in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string{noformat} There is a difference when reading. In univocity, nothing would be convert to nullValue strings. But In Spark, we finally convert nothing or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} For the emptyValue option, we add a emptyValueInRead option for reading and a emptyValueInWrite option for writing. I found that Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" string. {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} we actually get the DataFrame which data is shown as: ||make||comment|| |tesla|EMPTY| but the DataFrame which data should be shown as below as expected: ||make||comment|| |tesla| | {color:#de350b}*We can not recovery it to the original DataFrame.*{color} was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . For the nullValue option, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as NULL strings in csv files and NULL
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . For the nullValue option, according to the features description in spark-csv readme file, it is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as NULL strings in csv files and NULL strings in csv files can be parsed as columns of null values in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string{noformat} There is a difference when reading. In univocity, nothing would be convert to nullValue strings. But In Spark, we finally convert nothing or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" string. {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} we actually get the DataFrame which data is shown as: ||make||comment|| |tesla|EMPTY| but the DataFrame which data should be shown as below as expected: ||make||comment|| |tesla| | {color:#de350b}*We can not recovery it to the original DataFrame.*{color} was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . According to the features description in spark-csv readme file, the nullValue option is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as NULL strings in csv files and NULL strings in csv files can be parsed as columns of null values in dataframe. That is: {noformat} When writing, convert null(in dataframe) to
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR [10766|https://github.com/apache/spark/pull/10766] . According to the features description in spark-csv readme file, the nullValue option is designed as: {noformat} When reading files: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For example, when writing: {code:scala} Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} When reading: {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} The parsed dataframe is shown as: ||make||comment|| |tesla|null| We can find that null columns in dataframe can be saved as NULL strings in csv files and NULL strings in csv files can be parsed as columns of null values in dataframe. That is: {noformat} When writing, convert null(in dataframe) to nullValue(in csv) When reading, convert nullValue or nothing(in csv) to null(in dataframe) {noformat} But actually, the option nullValue in component univocity's {*}_CommonSettings_{*}, is designed as that: {noformat} when reading, if the parser does not read any character from the input, the nullValue is used instead of an empty string when writing, if the writer has a null object to write to the output, the nullValue is used instead of an empty string{noformat} There is a difference when reading. In univocity, nothing would be convert to nullValue strings. But In Spark, we finally convert nothing or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method: {code:java} private def nullSafeDatum( datum: String, name: String, nullable: Boolean, options: CSVOptions)(converter: ValueConverter): Any = { if (datum == options.nullValue || datum == null) { if (!nullable) { throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name) } null } else { converter.apply(datum) } } {code} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" string. {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} we actually get the DataFrame which data is shown as: ||make||comment|| |tesla|EMPTY| but the DataFrame which data should be shown as below as expected: ||make||comment|| |tesla| | {color:#de350b}*We can not recovery it to the original DataFrame.*{color} was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue [SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR [10766|https://github.com/apache/spark/pull/10766] . According to databricks spark-csv's features description in readme file, the nullValue option is designed as: {noformat} When reading files the API accepts several options: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files the API accepts several options: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For null values, the parameter nullValue can be set when reading or writing in CSVOptions: {code:scala} // For writing, convert: null(dataframe) => nullValue(csv) // For reading, convert: nullValue or ,,(csv) => null(dataframe) {code} For example, a column has null values, if nullValue is set to "null" string. {code:scala} Seq(("Tesla", null.asInstanceOf[String])).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as:
[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue [SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR [10766|https://github.com/apache/spark/pull/10766] . According to databricks spark-csv's features description in readme file, the nullValue option is designed as: {noformat} When reading files the API accepts several options: nullValue: specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame When writing files the API accepts several options: nullValue: specifies a string that indicates a null value, nulls in the DataFrame will be written as this string. {noformat} For null values, the parameter nullValue can be set when reading or writing in CSVOptions: {code:scala} // For writing, convert: null(dataframe) => nullValue(csv) // For reading, convert: nullValue or ,,(csv) => null(dataframe) {code} For example, a column has null values, if nullValue is set to "null" string. {code:scala} Seq(("Tesla", null.asInstanceOf[String])).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} and if we read this csv file with nullValue set to "null" string. {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} we can get the DataFrame which data is same with the original shown as: ||make||comment|| |tesla|null| {color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" string. {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} we actually get the DataFrame which data is shown as: ||make||comment|| |tesla|EMPTY| but the DataFrame which data should be shown as below as expected: ||make||comment|| |tesla| | {color:#de350b}*We can not recovery it to the original DataFrame.*{color} was: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue [SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR [10766|https://github.com/apache/spark/pull/10766] . In databricks spark-csv, For null values, the parameter nullValue can be set when reading or writing in CSVOptions: {code:scala} // For writing, convert: null(dataframe) => nullValue(csv) // For reading, convert: nullValue or ,,(csv) => null(dataframe) {code} For example, a column has null values, if nullValue is set to "null" string. {code:scala} Seq(("Tesla", null.asInstanceOf[String])).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} and if we read this csv file with nullValue set to "null" string. {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} we can get the DataFrame which data is same with the original shown as: ||make||comment|| |tesla|null| {color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", "")).toDF("make",
[jira] [Updated] (SPARK-37604) The parameter emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading
[ https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Guo updated SPARK-37604: Description: The csv data format is imported from databricks [spark-csv|https://github.com/databricks/spark-csv] by issue [SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR [10766|https://github.com/apache/spark/pull/10766] . In databricks spark-csv, For null values, the parameter nullValue can be set when reading or writing in CSVOptions: {code:scala} // For writing, convert: null(dataframe) => nullValue(csv) // For reading, convert: nullValue or ,,(csv) => null(dataframe) {code} For example, a column has null values, if nullValue is set to "null" string. {code:scala} Seq(("Tesla", null.asInstanceOf[String])).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} and if we read this csv file with nullValue set to "null" string. {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} we can get the DataFrame which data is same with the original shown as: ||make||comment|| |tesla|null| {color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" string. {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} we actually get the DataFrame which data is shown as: ||make||comment|| |tesla|EMPTY| but the DataFrame which data should be shown as below as expected: ||make||comment|| |tesla| | {color:#de350b}*We can not recovery it to the original DataFrame.*{color} was: Csv data format is For null values, the parameter nullValue can be set when reading or writing in CSVOptions: {code:scala} // For writing, convert: null(dataframe) => nullValue(csv) // For reading, convert: nullValue or ,,(csv) => null(dataframe) {code} For example, a column has null values, if nullValue is set to "null" string. {code:scala} Seq(("Tesla", null.asInstanceOf[String])).toDF("make", "comment").write.option("nullValue", "NULL").csv(path){code} The saved csv file is shown as: {noformat} Tesla,NULL {noformat} and if we read this csv file with nullValue set to "null" string. {code:java} spark.read.option("nullValue", "NULL").csv(path).show() {code} we can get the DataFrame which data is same with the original shown as: ||make||comment|| |tesla|null| {color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color} Since Spark 2.4, for empty strings, there are emptyValueInRead for reading and emptyValueInWrite for writing that can be set in CSVOptions: {code:scala} // For writing, convert: ""(dataframe) => emptyValueInWrite(csv) // For reading, convert: "" (csv) => emptyValueInRead(dataframe){code} I think the read handling is not suitable, we can not convert "" or `{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but get {color:#172b4d}emptyValueInRead's setting value actually{color}, it supposed to be as flows: {code:scala} // For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code} For example, a column has empty strings, if emptyValueInWrite is set to "EMPTY" string. {code:scala} Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code} The saved csv file is shown as: {noformat} Tesla,EMPTY {noformat} and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" string. {code:java} spark.read.option("emptyValue", "EMPTY").csv(path).show() {code} we actually get the DataFrame which data is shown as: ||make||comment|| |tesla|EMPTY| but the DataFrame which data should be shown as below as expected: ||make||comment|| |tesla| | {color:#de350b}*We can not recovery it to the original DataFrame.*{color} > The parameter emptyValueInRead(in CSVOptions) is suggested to be designed as > that any fields matching this string will be set as empty values "" when > reading >