[jira] [Updated] (SPARK-48691) Upgrade `scalatest` related dependencies to the 3.2.18 series

2024-06-23 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48691:

Summary: Upgrade `scalatest` related dependencies to the 3.2.18 series  
(was: Upgrade `mockito` to 5.12.0)

> Upgrade `scalatest` related dependencies to the 3.2.18 series
> -
>
> Key: SPARK-48691
> URL: https://issues.apache.org/jira/browse/SPARK-48691
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856910#comment-17856910
 ] 

Wei Guo edited comment on SPARK-48689 at 6/22/24 2:33 PM:
--

This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option big enough when reading, you can get the 
right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I did a test with a long string of length 20,000,010 and proved that:

!image-2024-06-22-15-33-38-833.png!

 


was (Author: wayne guo):
This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I did a test with a long string of length 20,000,010 and proved that:

!image-2024-06-22-15-33-38-833.png!

 

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856910#comment-17856910
 ] 

Wei Guo edited comment on SPARK-48689 at 6/22/24 7:36 AM:
--

This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I did a test with a long string of length 20,000,010 and proved that:

!image-2024-06-22-15-33-38-833.png!

 


was (Author: wayne guo):
This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I made a test with a 20,000,010 length string:

!image-2024-06-22-15-33-38-833.png!

 

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48689:

Attachment: image-2024-06-22-15-33-38-833.png

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48689) Reading lengthy JSON results in a corrupted record.

2024-06-22 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856910#comment-17856910
 ] 

Wei Guo commented on SPARK-48689:
-

This is controlled by option `maxStringLen`, the default value of it is 
20,000,000. If you set the option when reading, you can get the right result.
{code:java}
spark.read.option("maxStringLen", 1).json(path){code}
 

I made a test with a 20,000,010 length string:

!image-2024-06-22-15-33-38-833.png!

 

> Reading lengthy JSON results in a corrupted record.
> ---
>
> Key: SPARK-48689
> URL: https://issues.apache.org/jira/browse/SPARK-48689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1
> Environment: Ubuntu 22.04, Python 3.11, and OpenJDK 22
>Reporter: Yuxiang Wei
>Priority: Major
>  Labels: Reader
> Attachments: image-2024-06-22-15-33-38-833.png
>
>
> When reading a data frame from a JSON file including a very long string, 
> spark will incorrectly make it a corrupted record even though the format is 
> correct. Here is a minimal example with PySpark:
> {{import json}}
> {{import tempfile}}
> {{{}from pyspark.sql import SparkSession{}}}{{{}# Create a Spark session{}}}
> {{spark = (SparkSession.builde}}
> {{    .appName("PySpark JSON Example")}}
> {{    .getOrCreate()}}
> {{{}){}}}{{{}# Define the JSON content{}}}
> {{data = {}}
> {{    "text": "a" * 1}}
> {{{}}{}}}{{{}# Create a temporary file{}}}
> {{with tempfile.NamedTemporaryFile(delete=False, suffix=".json", mode="w") as 
> tmp_file:}}
> {{    # Write the JSON content to the temporary file}}
> {{    tmp_file.write(json.dumps(data) + "\n")}}
> {{    tmp_file_path = tmp_file.name}}{{    # Load the JSON file into a 
> PySpark DataFrame}}
> {{    df = spark.read.json(tmp_file_path)}}{{    # Print the schema}}
> {{    print(df)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48671) Add test cases for Hex.hex

2024-06-20 Thread Wei Guo (Jira)
Wei Guo created SPARK-48671:
---

 Summary: Add test cases for Hex.hex
 Key: SPARK-48671
 URL: https://issues.apache.org/jira/browse/SPARK-48671
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856122#comment-17856122
 ] 

Wei Guo commented on SPARK-48660:
-

I am working on this and thank your for recommendation [~yangjie01] .

> The result of explain is incorrect for CreateTableAsSelect
> --
>
> Key: SPARK-48660
> URL: https://issues.apache.org/jira/browse/SPARK-48660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> CREATE TABLE order_history_version_audit_rno (
>   eventid STRING,
>   id STRING,
>   referenceid STRING,
>   type STRING,
>   referencetype STRING,
>   sellerid BIGINT,
>   buyerid BIGINT,
>   producerid STRING,
>   versionid INT,
>   changedocuments ARRAY BIGINT, changeDetails: STRING>>,
>   dt STRING,
>   hr STRING)
> USING parquet
> PARTITIONED BY (dt, hr);
> explain cost
> CREATE TABLE order_history_version_audit_rno
> USING parquet
> PARTITIONED BY (dt)
> CLUSTERED BY (id) INTO 1000 buckets
> AS SELECT * FROM order_history_version_audit_rno
> WHERE dt >= '2023-11-29';
> {code}
> {noformat}
> spark-sql (default)> 
>> explain cost
>> CREATE TABLE order_history_version_audit_rno
>> USING parquet
>> PARTITIONED BY (dt)
>> CLUSTERED BY (id) INTO 1000 buckets
>> AS SELECT * FROM order_history_version_audit_rno
>> WHERE dt >= '2023-11-29';
> == Optimized Logical Plan ==
> CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
>   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> dt#15, hr#16]
>  +- Filter (dt#15 >= 2023-11-29)
> +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>+- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> == Physical Plan ==
> Execute CreateDataSourceTableAsSelectCommand
>+- CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
> +- Project [eventid#5, id#6, referenceid#7, type#8, 
> referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
> changedocuments#14, dt#15, hr#16]
>+- Filter (dt#15 >= 2023-11-29)
>   +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>  +- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856122#comment-17856122
 ] 

Wei Guo edited comment on SPARK-48660 at 6/19/24 4:18 AM:
--

I am working on this and thank your for recommendation [~LuciferYang] 


was (Author: wayne guo):
I am working on this and thank your for recommendation [~yangjie01] .

> The result of explain is incorrect for CreateTableAsSelect
> --
>
> Key: SPARK-48660
> URL: https://issues.apache.org/jira/browse/SPARK-48660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> CREATE TABLE order_history_version_audit_rno (
>   eventid STRING,
>   id STRING,
>   referenceid STRING,
>   type STRING,
>   referencetype STRING,
>   sellerid BIGINT,
>   buyerid BIGINT,
>   producerid STRING,
>   versionid INT,
>   changedocuments ARRAY BIGINT, changeDetails: STRING>>,
>   dt STRING,
>   hr STRING)
> USING parquet
> PARTITIONED BY (dt, hr);
> explain cost
> CREATE TABLE order_history_version_audit_rno
> USING parquet
> PARTITIONED BY (dt)
> CLUSTERED BY (id) INTO 1000 buckets
> AS SELECT * FROM order_history_version_audit_rno
> WHERE dt >= '2023-11-29';
> {code}
> {noformat}
> spark-sql (default)> 
>> explain cost
>> CREATE TABLE order_history_version_audit_rno
>> USING parquet
>> PARTITIONED BY (dt)
>> CLUSTERED BY (id) INTO 1000 buckets
>> AS SELECT * FROM order_history_version_audit_rno
>> WHERE dt >= '2023-11-29';
> == Optimized Logical Plan ==
> CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
>   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> dt#15, hr#16]
>  +- Filter (dt#15 >= 2023-11-29)
> +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>+- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> == Physical Plan ==
> Execute CreateDataSourceTableAsSelectCommand
>+- CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
> +- Project [eventid#5, id#6, referenceid#7, type#8, 
> referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
> changedocuments#14, dt#15, hr#16]
>+- Filter (dt#15 >= 2023-11-29)
>   +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>  +- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48661) Upgrade RoaringBitmap to 1.1.0

2024-06-18 Thread Wei Guo (Jira)
Wei Guo created SPARK-48661:
---

 Summary: Upgrade RoaringBitmap to 1.1.0
 Key: SPARK-48661
 URL: https://issues.apache.org/jira/browse/SPARK-48661
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Summary:  Assign classes to join type errors  and as-of join error  (was:  
Assign classes to join type errors  and as-of join error 
_LEGACY_ERROR_TEMP_3217 )

>  Assign classes to join type errors  and as-of join error
> -
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> LEGACY_ERROR_TEMP[1319, 3216]
> as-of join error:
> _LEGACY_ERROR_TEMP_3217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Summary:  Assign classes to join type errors  and as-of join error 
_LEGACY_ERROR_TEMP_3217   (was:  Assign classes to join type errors 
_LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217 )

>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Description: 
job type errors: 
LEGACY_ERROR_TEMP[1319, 3216]
as-of join error:
_LEGACY_ERROR_TEMP_3217

  was:
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:
_LEGACY_ERROR_TEMP_3217


>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> LEGACY_ERROR_TEMP[1319, 3216]
> as-of join error:
> _LEGACY_ERROR_TEMP_3217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Description: 
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:


>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> _LEGACY_ERROR_TEMP_[1319, 3216]
> as-of join error:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48635) Assign classes to join type errors and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48635:

Description: 
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:
_LEGACY_ERROR_TEMP_3217

  was:
job type errors: 
_LEGACY_ERROR_TEMP_[1319, 3216]
as-of join error:



>  Assign classes to join type errors  and as-of join error 
> _LEGACY_ERROR_TEMP_3217 
> --
>
> Key: SPARK-48635
> URL: https://issues.apache.org/jira/browse/SPARK-48635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>
> job type errors: 
> _LEGACY_ERROR_TEMP_[1319, 3216]
> as-of join error:
> _LEGACY_ERROR_TEMP_3217



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48635) Assign classes to join type errors _LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217

2024-06-14 Thread Wei Guo (Jira)
Wei Guo created SPARK-48635:
---

 Summary:  Assign classes to join type errors 
_LEGACY_ERROR_TEMP_[1319, 3216] and as-of join error _LEGACY_ERROR_TEMP_3217 
 Key: SPARK-48635
 URL: https://issues.apache.org/jira/browse/SPARK-48635
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48614) Cleanup deprecated api usage related to kafka-clients

2024-06-13 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48614:

Description: (was: There are some deprecated classes and methods in 
commons-io called in Spark, we need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream)

> Cleanup deprecated api usage related to kafka-clients
> -
>
> Key: SPARK-48614
> URL: https://issues.apache.org/jira/browse/SPARK-48614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48614) Cleanup deprecated api usage related to kafka-clients

2024-06-13 Thread Wei Guo (Jira)
Wei Guo created SPARK-48614:
---

 Summary: Cleanup deprecated api usage related to kafka-clients
 Key: SPARK-48614
 URL: https://issues.apache.org/jira/browse/SPARK-48614
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo
Assignee: Wei Guo
 Fix For: 4.0.0


There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48604) Replace deprecated classes and methods of arrow-vector called in Spark

2024-06-12 Thread Wei Guo (Jira)
Wei Guo created SPARK-48604:
---

 Summary: Replace deprecated classes and methods of arrow-vector 
called in Spark
 Key: SPARK-48604
 URL: https://issues.apache.org/jira/browse/SPARK-48604
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo


There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48604) Replace deprecated classes and methods of arrow-vector called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48604:

Description: 
There are some deprecated classes and methods in arrow-vector called in Spark, 
we need to replace them:
 * ArrowType.Decimal(precision, scale)

  was:
There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream


> Replace deprecated classes and methods of arrow-vector called in Spark
> --
>
> Key: SPARK-48604
> URL: https://issues.apache.org/jira/browse/SPARK-48604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in arrow-vector called in 
> Spark, we need to replace them:
>  * ArrowType.Decimal(precision, scale)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of commons-io called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Summary: Replace deprecated classes and methods of commons-io called in 
Spark  (was: Replace deprecated classes and methods of `commons-io` called in 
Spark)

> Replace deprecated classes and methods of commons-io called in Spark
> 
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in commons-io called in Spark, 
> we need to replace them:
>  * writeStringToFile(final File file, final String data)
>  * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Description: 
There are some deprecated classes and methods in commons-io called in Spark, we 
need to replace them:
 * writeStringToFile(final File file, final String data)
 * CountingInputStream

  was:
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 * `writeStringToFile(final File file, final String data);
 * `CountingInputStream`


> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in commons-io called in Spark, 
> we need to replace them:
>  * writeStringToFile(final File file, final String data)
>  * CountingInputStream



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Description: 
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 *   `writeStringToFile(final File file, final String data);
 * `CountingInputStream`

  was:Method `writeStringToFile(final File file, final String data)` in class 
`FileUtils` is deprecated, use `writeStringToFile(final File file, final String 
data, final Charset charset)` instead in UDFXPathUtilSuite.


> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in `commons-io` called in 
> Spark, we need to replace them:
>  *   `writeStringToFile(final File file, final String data);
>  * `CountingInputStream`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Description: 
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 * `writeStringToFile(final File file, final String data);
 * `CountingInputStream`

  was:
There are some deprecated classes and methods in `commons-io` called in Spark, 
we need to replace them:
 *   `writeStringToFile(final File file, final String data);
 * `CountingInputStream`


> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> There are some deprecated classes and methods in `commons-io` called in 
> Spark, we need to replace them:
>  * `writeStringToFile(final File file, final String data);
>  * `CountingInputStream`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48583) Replace deprecated classes and methods of `commons-io` called in Spark

2024-06-12 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48583:

Summary: Replace deprecated classes and methods of `commons-io` called in 
Spark  (was: Replace deprecated `FileUtils#writeStringToFile` )

> Replace deprecated classes and methods of `commons-io` called in Spark
> --
>
> Key: SPARK-48583
> URL: https://issues.apache.org/jira/browse/SPARK-48583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Major
>  Labels: pull-request-available
>
> Method `writeStringToFile(final File file, final String data)` in class 
> `FileUtils` is deprecated, use `writeStringToFile(final File file, final 
> String data, final Charset charset)` instead in UDFXPathUtilSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48583) Replace deprecated `FileUtils#writeStringToFile`

2024-06-11 Thread Wei Guo (Jira)
Wei Guo created SPARK-48583:
---

 Summary: Replace deprecated `FileUtils#writeStringToFile` 
 Key: SPARK-48583
 URL: https://issues.apache.org/jira/browse/SPARK-48583
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wei Guo


Method `writeStringToFile(final File file, final String data)` in class 
`FileUtils` is deprecated, use `writeStringToFile(final File file, final String 
data, final Charset charset)` instead in UDFXPathUtilSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48581) Upgrade dropwizard metrics to 4.2.26

2024-06-10 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-48581:

Summary: Upgrade dropwizard metrics to 4.2.26  (was: Upgrade dropwizard 
metrics 4.2.26)

> Upgrade dropwizard metrics to 4.2.26
> 
>
> Key: SPARK-48581
> URL: https://issues.apache.org/jira/browse/SPARK-48581
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Wei Guo
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48581) Upgrade dropwizard metrics 4.2.26

2024-06-10 Thread Wei Guo (Jira)
Wei Guo created SPARK-48581:
---

 Summary: Upgrade dropwizard metrics 4.2.26
 Key: SPARK-48581
 URL: https://issues.apache.org/jira/browse/SPARK-48581
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48539) Upgrade docker-java to 3.3.6

2024-06-05 Thread Wei Guo (Jira)
Wei Guo created SPARK-48539:
---

 Summary: Upgrade docker-java to 3.3.6
 Key: SPARK-48539
 URL: https://issues.apache.org/jira/browse/SPARK-48539
 Project: Spark
  Issue Type: Improvement
  Components: Spark Docker
Affects Versions: 4.0.0
Reporter: Wei Guo
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47259) Assign classes to interval errors

2024-05-28 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850238#comment-17850238
 ] 

Wei Guo commented on SPARK-47259:
-

Update `_LEGACY_ERROR_TEMP_32[08-14]` to `_LEGACY_ERROR_TEMP_32[09-14]`, 
because `
_LEGACY_ERROR_TEMP_3208` is not related to interval errors.

> Assign classes to interval errors
> -
>
> Key: SPARK-47259
> URL: https://issues.apache.org/jira/browse/SPARK-47259
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47259) Assign classes to interval errors

2024-05-28 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-47259:

Description: 
Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* defined 
in {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
short but complete (look at the example in error-classes.json).

Add a test which triggers the error from user code if such test still doesn't 
exist. Check exception fields by using {*}checkError(){*}. The last function 
checks valuable error fields only, and avoids dependencies from error text 
message. In this way, tech editors can modify error format in 
error-classes.json, and don't worry of Spark's internal tests. Migrate other 
tests that might trigger the error onto checkError().

If you cannot reproduce the error from user space (using SQL query), replace 
the error by an internal error, see {*}SparkException.internalError(){*}.

Improve the error message format in error-classes.json if the current is not 
clear. Propose a solution to users how to avoid and fix such kind of errors.

Please, look at the PR below as examples:
 * [https://github.com/apache/spark/pull/38685]
 * [https://github.com/apache/spark/pull/38656]
 * [https://github.com/apache/spark/pull/38490]

  was:
Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[08-14]* defined 
in {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
short but complete (look at the example in error-classes.json).

Add a test which triggers the error from user code if such test still doesn't 
exist. Check exception fields by using {*}checkError(){*}. The last function 
checks valuable error fields only, and avoids dependencies from error text 
message. In this way, tech editors can modify error format in 
error-classes.json, and don't worry of Spark's internal tests. Migrate other 
tests that might trigger the error onto checkError().

If you cannot reproduce the error from user space (using SQL query), replace 
the error by an internal error, see {*}SparkException.internalError(){*}.

Improve the error message format in error-classes.json if the current is not 
clear. Propose a solution to users how to avoid and fix such kind of errors.

Please, look at the PR below as examples:
 * [https://github.com/apache/spark/pull/38685]
 * [https://github.com/apache/spark/pull/38656]
 * [https://github.com/apache/spark/pull/38490]


> Assign classes to interval errors
> -
>
> Key: SPARK-47259
> URL: https://issues.apache.org/jira/browse/SPARK-47259
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_32[09-14]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40678) JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13

2023-02-12 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687573#comment-17687573
 ] 

Wei Guo commented on SPARK-40678:
-

Fixed by PR 38154 https://github.com/apache/spark/pull/38154

> JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13
> 
>
> Key: SPARK-40678
> URL: https://issues.apache.org/jira/browse/SPARK-40678
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.2.0
>Reporter: Cédric Chantepie
>Priority: Major
>
> In Spark 3.2 (Scala 2.13), values with {{ArrayType}} are no longer properly 
> support with JSON; e.g.
> {noformat}
> import org.apache.spark.sql.SparkSession
> case class KeyValue(key: String, value: Array[Byte])
> val spark = 
> SparkSession.builder().master("local[1]").appName("test").getOrCreate()
> import spark.implicits._
> val df = Seq(Array(KeyValue("foo", "bar".getBytes))).toDF()
> df.foreach(r => println(r.json))
> {noformat}
> Expected:
> {noformat}
> [{foo, bar}]
> {noformat}
> Encountered:
> {noformat}
> java.lang.IllegalArgumentException: Failed to convert value 
> ArraySeq([foo,[B@dcdb68f]) (class of class 
> scala.collection.mutable.ArraySeq$ofRef}) with the type of 
> ArrayType(Seq(StructField(key,StringType,false), 
> StructField(value,BinaryType,false)),true) to JSON.
>   at org.apache.spark.sql.Row.toJson$1(Row.scala:604)
>   at org.apache.spark.sql.Row.jsonValue(Row.scala:613)
>   at org.apache.spark.sql.Row.jsonValue$(Row.scala:552)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.jsonValue(rows.scala:166)
>   at org.apache.spark.sql.Row.json(Row.scala:535)
>   at org.apache.spark.sql.Row.json$(Row.scala:535)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.json(rows.scala:166)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39348) Create table in overwrite mode fails when interrupted

2023-02-09 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686390#comment-17686390
 ] 

Wei Guo commented on SPARK-39348:
-

After PR [https://github.com/apache/spark/pull/26559,] it has been removed.
 * Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set {{true}} to 
{{spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation}} restores 
the previous behavior. This option will be removed in Spark 3.0.

> Create table in overwrite mode fails when interrupted
> -
>
> Key: SPARK-39348
> URL: https://issues.apache.org/jira/browse/SPARK-39348
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.1
>Reporter: Max
>Priority: Major
>
> When you attempt to rerun an Apache Spark write operation by cancelling the 
> currently running job, the following error occurs:
> {code:java}
> Error: org.apache.spark.sql.AnalysisException: Cannot create the managed 
> table('`testdb`.` testtable`').
> The associated location 
> ('dbfs:/user/hive/warehouse/testdb.db/metastore_cache_ testtable) already 
> exists.;{code}
> This problem can occur if:
>  * The cluster is terminated while a write operation is in progress.
>  * A temporary network issue occurs.
>  * The job is interrupted.
> You can reproduce the problem by following these steps:
> 1. Create a DataFrame:
> {code:java}
> val df = spark.range(1000){code}
> 2. Write the DataFrame to a location in overwrite mode:
> {code:java}
> df.write.mode(SaveMode.Overwrite).saveAsTable("testdb.testtable"){code}
> 3. Cancel the command while it is executing.
> 4. Re-run the {{write}} command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write{color:#57d9a3}.option("comment", "#"){color}.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|default behavior: the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.{color:#57d9a3}option("comment", "\u"){color}.csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.{color:#57d9a3}option("comment", "#"){color}.csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|default behavior: the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", 

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code| |2.4 and before|3.0 and after|current update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)| |#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)| |#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)| |\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)| |#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|
 


> Pass the comment option 

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 

After this change, the behavior as flows:
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|

 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code|2.4 and before|3.0 and after|this update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)|#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)|#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)|#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}this update has a little bit 
difference with 3.0{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)|\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)|#abc
xyz|#abc

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

 
|id|code| |2.4 and before|3.0 and after|current update|remark|
|1|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "\u").csv(path)| |#abc
*def*
xyz|{color:#4c9aff}"#abc"{color}
{color:#4c9aff}*def*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}*"def"*{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a little bit difference{color}|
|2|Seq("#abc", "\udef", "xyz").toDF()
.write.option("comment", "#").csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|3|Seq("#abc", "\udef", "xyz").toDF()
.write.csv(path)| |#abc
*def*
xyz|"#abc"
*def*
xyz|"#abc"
*def*
xyz|the same|
|4|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "\u").csv(path)| |#abc
xyz|{color:#4c9aff}#abc{color}
{color:#4c9aff}\udef{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}#abc{color}
{color:#4c9aff}xyz{color}|{color:#4c9aff}a liitle bit difference{color}|
|5|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.option("comment", "#").csv(path)| |\udef
xyz|\udef
xyz|\udef
xyz|the same|
|6|{_}Seq{_}("#abc", "\udef", "xyz").toDF().write.text(path)
spark.read.csv(path)| |#abc
xyz|#abc
\udef
xyz|#abc
\udef
xyz|the same|
 

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.


> Pass the comment option through to univocity if users set it explicitly in 
> CSV dataSource
> -
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option 

[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicitly in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Summary: Pass the comment option through to univocity if users set it 
explicitly in CSV dataSource  (was: Pass the comment option through to 
univocity if users set it explicity in CSV dataSource)

> Pass the comment option through to univocity if users set it explicitly in 
> CSV dataSource
> -
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to univocity if users set it 
> explicitly in CSV dataSource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to univocity if users set it 
explicitly in CSV dataSource.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to Univocity if users set it 
explicitly in CSV dataSource.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to univocity if users set it 
> explicitly in CSV dataSource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior as 
before because the new added `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
It's better to pass the comment option through to Univocity if users set it 
explicitly in CSV dataSource.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
  until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior as 
> before because the new added `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
> It's better to pass the comment option through to Univocity if users set it 
> explicitly in CSV dataSource.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  // other code
  if (isCommentSet) {
format.setComment(comment)
  }
  // other code
}
 {code}
  until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  xx
  if (isCommentSet) {
format.setComment(comment)
  }
}
 {code}
  until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior 
> before because the `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   // other code
>   if (isCommentSet) {
> format.setComment(comment)
>   }
>   // other code
> }
>  {code}
>   until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For users, they can't set comment option to '\u'  to keep the behavior 
before because the `isCommentSet` check logic as follows:
{code:java}
val isCommentSet = this.comment != '\u'


def asWriterSettings: CsvWriterSettings = {
  xx
  if (isCommentSet) {
format.setComment(comment)
  }
}
 {code}
  until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For users, they can't set comment option to '\u'  to keep the behavior 
> before because the `isCommentSet` check logic as follows:
> {code:java}
> val isCommentSet = this.comment != '\u'
> def asWriterSettings: CsvWriterSettings = {
>   xx
>   if (isCommentSet) {
> format.setComment(comment)
>   }
> }
>  {code}
>   until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!

For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.


> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Pass the comment option through to univocity if users set it explicity in CSV dataSource

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Summary: Pass the comment option through to univocity if users set it 
explicity in CSV dataSource  (was: Add a legacy config for restoring writer's 
comment option behavior in CSV dataSource)

> Pass the comment option through to univocity if users set it explicity in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42252:

Fix Version/s: 3.5.0
   (was: 3.4.0)

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.5.0
>
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42252:

Affects Version/s: 3.3.0
   3.2.0
   3.1.0
   3.4.0

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.4.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.5.0
>
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-02-05 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42252:

Target Version/s: 3.5.0  (was: 3.4.0)

> Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config
> --
>
> Key: SPARK-42252
> URL: https://issues.apache.org/jira/browse/SPARK-42252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
>
> After Jira SPARK-28209 and PR 
> [25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
> api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
> SortShuffleWriter, UnsafeShuffleWriter) are based on 
> LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
> spark.shuffle.unsafe.file.output.buffer used in 
> LocalDiskShuffleMapOutputWriter was only used in UnsafeShuffleWriter before. 
>  
> It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Attachment: image-2023-02-03-18-56-10-083.png

> Add a legacy config for restoring writer's comment option behavior in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-44-30-296.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-15-12-661.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Description: 
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-56-01-596.png!
After this change, the content is shown as:
!image-2023-02-03-18-56-10-083.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.

  was:
In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-44-30-296.png!
After this change, the content is shown as:
!image-2023-02-03-18-15-12-661.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.


> Add a legacy config for restoring writer's comment option behavior in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-56-01-596.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-56-10-083.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42335:

Attachment: image-2023-02-03-18-56-01-596.png

> Add a legacy config for restoring writer's comment option behavior in CSV 
> dataSource
> 
>
> Key: SPARK-42335
> URL: https://issues.apache.org/jira/browse/SPARK-42335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-02-03-18-56-01-596.png, 
> image-2023-02-03-18-56-10-083.png
>
>
> In PR [https://github.com/apache/spark/pull/29516], in order to fix some 
> bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 
> 2.9.0, it also involved a new feature of univocity-parsers that quoting 
> values of the first column that start with the comment character. It made a 
> breaking for users downstream that handing a whole row as input.
>  
> For codes:
> {code:java}
> Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
> Before Spark 3.0,the content of output CSV files is shown as:
> !image-2023-02-03-18-44-30-296.png!
> After this change, the content is shown as:
> !image-2023-02-03-18-15-12-661.png!
> I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] 
> for issue 
> [univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
>  to univocity-parses, but it seems to be a long time for waiting it to be 
> merged.
>  
> For Spark, it's better to add a legacy config to restores the legacy behavior 
> until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42335) Add a legacy config for restoring writer's comment option behavior in CSV dataSource

2023-02-03 Thread Wei Guo (Jira)
Wei Guo created SPARK-42335:
---

 Summary: Add a legacy config for restoring writer's comment option 
behavior in CSV dataSource
 Key: SPARK-42335
 URL: https://issues.apache.org/jira/browse/SPARK-42335
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0, 3.2.0, 3.1.0, 3.0.0
Reporter: Wei Guo
 Fix For: 3.4.0


In PR [https://github.com/apache/spark/pull/29516], in order to fix some bugs, 
univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it 
also involved a new feature of univocity-parsers that quoting values of the 
first column that start with the comment character. It made a breaking for 
users downstream that handing a whole row as input.
 
For codes:
{code:java}
Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") {code}
Before Spark 3.0,the content of output CSV files is shown as:
!image-2023-02-03-18-44-30-296.png!
After this change, the content is shown as:
!image-2023-02-03-18-15-12-661.png!
I have made a PR [https://github.com/uniVocity/univocity-parsers/pull/518] for 
issue 
[univocity-parsers#505|https://github.com/uniVocity/univocity-parsers/issues/505]
 to univocity-parses, but it seems to be a long time for waiting it to be 
merged.
 
For Spark, it's better to add a legacy config to restores the legacy behavior 
until univocity-parses releases a new version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format

2023-02-02 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42237:

Description: 
When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless.
{code:java}
val df = Seq(Array[Byte](1,2)).toDF
df.write.csv("/Users/guowei/Desktop/binary_csv")
{code}
The csv file's content is as follows:

!image-2023-01-30-17-21-09-212.png|width=141,height=29!

Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.
{code:java}
val df = Seq((1, Array[Byte](1,2))).toDF
df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from 
binaryDataTable").show()
{code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!

So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).

  was:
When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless.
{code:java}
val df = Seq(Array[Byte](1,2)).toDF
df.write.csv("/Users/guowei19/Desktop/binary_csv")
{code}

The csv file's content is as follows:

!image-2023-01-30-17-21-09-212.png|width=141,height=29!

Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.

{code:java}
val df = Seq((1, Array[Byte](1,2))).toDF
df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from 
binaryDataTable").show()
{code}

!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!

So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).


> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = Seq(Array[Byte](1,2)).toDF
> df.write.csv("/Users/guowei/Desktop/binary_csv")
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, Array[Byte](1,2))).toDF
> df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from 
> binaryDataTable").show()
> {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42252) Deprecate spark.shuffle.unsafe.file.output.buffer and add a new config

2023-01-30 Thread Wei Guo (Jira)
Wei Guo created SPARK-42252:
---

 Summary: Deprecate spark.shuffle.unsafe.file.output.buffer and add 
a new config
 Key: SPARK-42252
 URL: https://issues.apache.org/jira/browse/SPARK-42252
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Wei Guo
 Fix For: 3.4.0


After Jira SPARK-28209 and PR 
[25007|[https://github.com/apache/spark/pull/25007]], the new shuffle writer 
api is proposed. All shuffle writers(BypassMergeSortShuffleWriter, 
SortShuffleWriter, UnsafeShuffleWriter) are based on 
LocalDiskShuffleMapOutputWriter to write local disk shuffle files. The config 
spark.shuffle.unsafe.file.output.buffer used in LocalDiskShuffleMapOutputWriter 
was only used in UnsafeShuffleWriter before. 
 
It's better to rename it and make it more suitable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17681978#comment-17681978
 ] 

Wei Guo commented on SPARK-42237:
-

a pr is ready~

> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = 
> Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, 
> Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
>  * from binaryDataTable").show() {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)


[ https://issues.apache.org/jira/browse/SPARK-42237 ]


Wei Guo deleted comment on SPARK-42237:
-

was (Author: wayne guo):
a pr is ready~

> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = 
> Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, 
> Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
>  * from binaryDataTable").show() {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42237:

Description: 
When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless.
{code:java}
val df = 
Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
{code}
The csv file's content is as follows:
!image-2023-01-30-17-21-09-212.png|width=141,height=29!
Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.
{code:java}
val df = Seq((1, 
Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
 * from binaryDataTable").show() {code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).

  was:
When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless. 
{code:java}
val df = 
Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
{code}
The csv file's content is as follows:
!image-2023-01-30-17-18-16-372.png|width=104,height=21!
Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.
{code:java}
val df = Seq((1, 
Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
 * from binaryDataTable").show() {code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).


> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = 
> Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, 
> Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
>  * from binaryDataTable").show() {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-42237:

Attachment: image-2023-01-30-17-21-09-212.png

> change binary to unsupported dataType in csv format
> ---
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.3.1
>Reporter: Wei Guo
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm 
> is {*}object.toString(){*}, which is meaningless. 
> {code:java}
> val df = 
> Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-18-16-372.png|width=104,height=21!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
> can't be read back successfully.
> {code:java}
> val df = Seq((1, 
> Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
>  * from binaryDataTable").show() {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, 
> both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42237) change binary to unsupported dataType in csv format

2023-01-30 Thread Wei Guo (Jira)
Wei Guo created SPARK-42237:
---

 Summary: change binary to unsupported dataType in csv format
 Key: SPARK-42237
 URL: https://issues.apache.org/jira/browse/SPARK-42237
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.1, 2.4.8
Reporter: Wei Guo
 Fix For: 3.4.0


When a binary colunm is written into csv files, actual content of this colunm 
is {*}object.toString(){*}, which is meaningless. 
{code:java}
val df = 
Seq(Array[Byte](1,2)).toDFdf.write.csv("/Users/guowei19/Desktop/binary_csv") 
{code}
The csv file's content is as follows:
!image-2023-01-30-17-18-16-372.png|width=104,height=21!
Meanwhile, if a binary colunm saved as table with csv fileformat, the table 
can't be read back successfully.
{code:java}
val df = Seq((1, 
Array[Byte](1,2))).toDFdf.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select
 * from binaryDataTable").show() {code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963=Eiscz4oMI45Sfp=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
So I think it' better to change binary to unsupported dataType in csv format, 
both for datasource v1(CSVFileFormat) and v2(CSVTable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature

2022-07-28 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572502#comment-17572502
 ] 

Wei Guo commented on SPARK-39901:
-

The `ignoreCorruptFiles` features in SQL(spark.sql.files.ignoreCorruptFiles) 
and RDD(spark.files.ignoreCorruptFiles) scenarios need to be included both. 

> Reconsider design of ignoreCorruptFiles feature
> ---
>
> Key: SPARK-39901
> URL: https://issues.apache.org/jira/browse/SPARK-39901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I'm filing this ticket as a followup to the discussion at 
> [https://github.com/apache/spark/pull/36775#issuecomment-1148136217] 
> regarding the `ignoreCorruptFiles` feature: the current implementation is 
> based towards considering a broad range of IOExceptions to be corruption, but 
> this is likely overly-broad and might mis-identify transient errors as 
> corruption (causing non-corrupt data to be erroneously discarded).
> SPARK-39389 fixes one instance of that problem, but we are still vulnerable 
> to similar issues because of the overall design of this feature.
> I think we should reconsider the design of this feature: maybe we should 
> switch the default behavior so that only an explicit allowlist of known 
> corruption exceptions can cause files to be skipped. This could be done 
> through involvement of other parts of the code, e.g. rewrapping exceptions 
> into a `CorruptFileException` so higher layers can positively identify 
> corruption.
> Any changes to behavior here could potentially impact users jobs, so we'd 
> need to think carefully about when we want to change (in a 3.x release? 4.x?) 
> and how we want to provide escape hatches (e.g. configs to revert back to old 
> behavior). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37575) null values should be saved as nothing rather than quoted empty Strings "" with default settings

2022-01-06 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37575:

Description: 
As mentioned in sql migration 
guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
{noformat}
Since Spark 2.4, empty strings are saved as quoted empty strings "". In version 
2.3 and earlier, empty strings are equal to null values and do not reflect to 
any characters in saved CSV files. For example, the row of "a", null, "", 1 was 
written as a,,,1. Since Spark 2.4, the same row is saved as a,,"",1. To restore 
the previous behavior, set the CSV option emptyValue to empty (not quoted) 
string.{noformat}
But actually, both empty strings and null values are saved as quoted empty 
Strings "" rather than "" (for empty strings) and nothing(for null values)。

code:
{code:java}
val data = List("spark", null, "").toDF("name")
data.coalesce(1).write.csv("spark_csv_test")
{code}
 actual result:
{noformat}
line1: spark
line2: ""
line3: ""{noformat}
expected result:
{noformat}
line1: spark
line2: 
line3: ""
{noformat}

  was:
As mentioned in sql migration 
guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
{noformat}
Since Spark 2.4, empty strings are saved as quoted empty strings "". In version 
2.3 and earlier, empty strings are equal to null values and do not reflect to 
any characters in saved CSV files. For example, the row of "a", null, "", 1 was 
written as a,,,1. Since Spark 2.4, the same row is saved as a,,"",1. To restore 
the previous behavior, set the CSV option emptyValue to empty (not quoted) 
string.{noformat}
 

But actually, both empty strings and null values are saved as quoted empty 
Strings "" rather than "" (for empty strings) and nothing(for null values)。

code:
{code:java}
val data = List("spark", null, "").toDF("name")
data.coalesce(1).write.csv("spark_csv_test")
{code}
 actual result:
{noformat}
line1: spark
line2: ""
line3: ""{noformat}
expected result:
{noformat}
line1: spark
line2: 
line3: ""
{noformat}


> null values should be saved as nothing rather than quoted empty Strings "" 
> with default settings
> 
>
> Key: SPARK-37575
> URL: https://issues.apache.org/jira/browse/SPARK-37575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Assignee: Wei Guo
>Priority: Major
> Fix For: 3.3.0
>
>
> As mentioned in sql migration 
> guide([https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24]),
> {noformat}
> Since Spark 2.4, empty strings are saved as quoted empty strings "". In 
> version 2.3 and earlier, empty strings are equal to null values and do not 
> reflect to any characters in saved CSV files. For example, the row of "a", 
> null, "", 1 was written as a,,,1. Since Spark 2.4, the same row is saved as 
> a,,"",1. To restore the previous behavior, set the CSV option emptyValue to 
> empty (not quoted) string.{noformat}
> But actually, both empty strings and null values are saved as quoted empty 
> Strings "" rather than "" (for empty strings) and nothing(for null values)。
> code:
> {code:java}
> val data = List("spark", null, "").toDF("name")
> data.coalesce(1).write.csv("spark_csv_test")
> {code}
>  actual result:
> {noformat}
> line1: spark
> line2: ""
> line3: ""{noformat}
> expected result:
> {noformat}
> line1: spark
> line2: 
> line3: ""
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-17 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo resolved SPARK-37604.
-
Resolution: Not A Problem

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) in dataframe.
> I think it's better that if we 

[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-17 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461525#comment-17461525
 ] 

Wei Guo commented on SPARK-37604:
-

Well, I think your explanation is clearly and reasonable and it convinced me. 
So I'll close this issue and the PR related. Thank you! 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue 

[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-17 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461393#comment-17461393
 ] 

Wei Guo commented on SPARK-37604:
-

As the consideration of Hyukjin Kwon in the PR related, if we worry about 
making a breaking change, we can add a new option to support it.

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue 

[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-17 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461390#comment-17461390
 ] 

Wei Guo commented on SPARK-37604:
-

In short, for null values, we can save null values in dataframe as "NULL" 
strings in csv files, and read back "NULL" strings as null values with the same 
nullValue option("NULL"). But for empty values, if we save empty values in 
dataframe as "EMPTY" strings in csv files, we can not read back "EMPTY" strings 
as empty values with the same emptyValue("EMPTY"), we finally get "EMPTY" 
strings. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> 

[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Issue Type: Improvement  (was: Bug)

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) in dataframe.
> I think it's better 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!empty_test.png|width=701,height=286!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!empty_test.png!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:05 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!empty_test.png!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for 

[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Attachment: (was: image-2021-12-16-01-57-55-864.png)

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) in dataframe.
> I think 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:04 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible. 

FYI. [~maxgekk] 


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for 

[jira] [Comment Edited] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 6:03 PM:


For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

 

For null values, we can write and read back with the same nullValue option, but 
for empty strings, even with same emptyValue option, it's irreversible


was (Author: wayne guo):
For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the 

[jira] [Commented] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460144#comment-17460144
 ] 

Wei Guo commented on SPARK-37604:
-

For codes:
{code:scala}
val data = Seq(("Tesla", "")).toDF("make", "comment")
data.write.option("emptyValue", "EMPTY").csv("/Users/guowei19/work/test_empty")
{code}
The csv file's content is as:
{noformat}
Tesla,EMPTY
{noformat}
(cat part-0-f0ed9c50-b1bf-4db9-9964-38fbf411e29c-c000.csv)

When I read it back to dataframe:
{code:scala}
spark.read.option("emptyValue", 
"EMPTY").schema(data.schema).csv("/Users/guowei19/work/test_empty").show()
{code}
I want the column *comment* is "" rather a "EMPTY" string.

!image-2021-12-16-01-57-55-864.png|width=424,height=173!

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in 

[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Attachment: empty_test.png

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) in dataframe.
> I 

[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Attachment: image-2021-12-16-01-57-55-864.png

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
> Attachments: empty_test.png, image-2021-12-16-01-57-55-864.png
>
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
> strings) to emptyValue strings rather than to convert both "\"\""(quoted 
> empty strings) and emptyValue strings to ""(empty) 

[jira] [Updated] (SPARK-37604) Change emptyValueInRead's effect to that any fields matching this string will be set as "" when reading csv files

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Summary: Change emptyValueInRead's effect to that any fields matching this 
string will be set as "" when reading csv files  (was: The option 
emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields 
matching this string will be set as empty values "" when reading)

> Change emptyValueInRead's effect to that any fields matching this string will 
> be set as "" when reading csv files
> -
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we 

[jira] [Comment Edited] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460091#comment-17460091
 ] 

Wei Guo edited comment on SPARK-37604 at 12/15/21, 4:48 PM:


[~hyukjin.kwon], [~maxgekk] Shall we have a simple discussion about it in your 
free time, I'd like to hear your thoughts on this.


was (Author: wayne guo):
[~hyukjin.kwon][~maxgekk] Shall we have a simple discussion about it in your 
free time, I'd like to hear your thoughts on this.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious 

[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460091#comment-17460091
 ] 

Wei Guo commented on SPARK-37604:
-

[~hyukjin.kwon][~maxgekk] Shall we have a simple discussion about it in your 
free time, I'd like to hear your thoughts on this.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For nullValue, we will convert nothing or nullValue strings to null 
> in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 

[jira] [Commented] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460084#comment-17460084
 ] 

Wei Guo commented on SPARK-37604:
-

Maybe this issue is not a notable bug or promotion, but, for users' common 
usage, they prefer to be able to convert these emptyValue strings in csv files 
into ""(empty strings) again after writing out empty strings as emptyValue 
strings rather than current behaviors.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue in read 
> handling. For 

[jira] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37604 ]


Wei Guo deleted comment on SPARK-37604:
-

was (Author: wayne guo):
The current behavior of emptyValueInRead is more like the function for null 
values in Dataset: 
{code:scala}
dataframe.na.fill(fillMap){code}

So, we can also provide a function in Dataset similar to it, such as:
{code:scala}
dataframe.empty.fill(fillMap){code}
rather than to change empty strings to emptyValueInRead in DataFrame when 
reading csv files.

> The option emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> --
>
> Key: SPARK-37604
> URL: https://issues.apache.org/jira/browse/SPARK-37604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.2.0
>Reporter: Wei Guo
>Priority: Major
>
> The csv data format is imported from databricks 
> [spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
> PR [10766|https://github.com/apache/spark/pull/10766] .
> {*}For the nullValue option{*}, according to features described in spark-csv 
> readme file, it's designed as:
> {noformat}
> When reading files:
> nullValue: specifies a string that indicates a null value, any fields 
> matching this string will be set as nulls in the DataFrame
> When writing files:
> nullValue: specifies a string that indicates a null value, nulls in the 
> DataFrame will be written as this string.
> {noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
> "NULL").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,NULL
> {noformat}
> When reading:
> {code:scala}
> spark.read.option("nullValue", "NULL").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|null|
> We can find that null columns in dataframe can be saved as "NULL" strings in 
> csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as 
> null columns*{color} in dataframe. That is:
> {noformat}
> When writing, convert null(in dataframe) to nullValue(in csv)
> When reading, convert nullValue or nothing(in csv) to null(in dataframe)
> {noformat}
> But actually, the option nullValue in depended component univocity's 
> {*}_CommonSettings_{*}, is designed as that:
> {noformat}
> when reading, if the parser does not read any character from the input, the 
> nullValue is used instead of an empty string.
> when writing, if the writer has a null object to write to the output, the 
> nullValue is used instead of an empty string.{noformat}
> {*}There is a difference when reading{*}. In univocity, nothing content will 
> be convert to nullValue strings. But In Spark, we finally convert nothing 
> content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
> method:
> {code:java}
> private def nullSafeDatum(
>  datum: String,
>  name: String,
>  nullable: Boolean,
>  options: CSVOptions)(converter: ValueConverter): Any = {
>   if (datum == options.nullValue || datum == null) {
> if (!nullable) {
>   throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
> }
> null
>   } else {
> converter.apply(datum)
>   }
> } {code}
>  
> From now, we start to talk about emptyValue.
> {*}For the emptyValue option{*},  we add a emptyValueInRead option for 
> reading and a emptyValueInWrite option for writing. I found that Spark keeps 
> the same behaviors for emptyValue with univocity, that is:
> {noformat}
> When reading, if the parser does not read any character from the input, and 
> the input is within quotes, the empty is used instead of an empty string.
> When writing, if the writer has an empty String to write to the output, the 
> emptyValue is used instead of an empty string.{noformat}
> For example, when writing:
> {code:scala}
> Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
> "EMPTY").csv(path){code}
> The saved csv file is shown as:
> {noformat}
> Tesla,EMPTY {noformat}
> When reading:
> {code:scala}
> spark.read.option("emptyValue", "EMPTY").csv(path).show()
> {code}
> The parsed dataframe is shown as:
> ||make||comment||
> |Tesla|EMPTY|
> We can find that empty columns in dataframe can be saved as "EMPTY" strings 
> in csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be 
> parsed as empty columns{color}* in dataframe. That is:
> {noformat}
> When writing, convert "" empty(in dataframe) to emptyValue(in csv)
> When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
> {noformat}
>  
> There is an obvious difference between nullValue and emptyValue 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:scala}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
{*}There is a difference when reading{*}. In univocity, nothing content will be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

>From now, we start to talk about emptyValue.

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that Spark keeps the same 
behaviors for emptyValue with univocity, that is:
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:scala}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is an obvious difference between nullValue and emptyValue in read 
handling. For nullValue, we will convert nothing or nullValue strings to null 
in dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue strings rather than to convert both "\"\""(quoted empty 
strings) and emptyValue strings to ""(empty) in dataframe.

I think it's better that if we keep the similar behavior(try to recover 
emptyValue in csv files to "") for emptyValue as nullValue when reading. So, I 
suggest that the emptyValueInRead(in CSVOptions) should  be designed as that 
any fields matching this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
{*}There is a difference when reading{*}. In univocity, nothing content will be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

>From now, we start to talk about emptyValue.

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that Spark keeps the same 
behaviors for emptyValue with univocity, that is:
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:scala}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
{*}There is a difference when reading{*}. In univocity, nothing content will be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

>From now, we start to talk about emptyValue.

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
{*}There is a difference when reading{*}. In univocity, nothing content will be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
{*}There is a difference when reading{*}. In univocity, nothing content would 
be convert to nullValue strings. But In Spark, we finally convert nothing 
content or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* 
method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and {color:#00875a}*"NULL" strings in csv files can be parsed as null 
columns*{color} in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|Tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features described in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features description in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to features description in spark-csv 
readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it's designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 with 
PR [10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so I suggest that the 
emptyValueInRead(in CSVOptions) should  be designed as that any fields matching 
this string will be set as empty values "" when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior(try to recover emptyValue 
to "") for emptyValue as nullValue when reading, so 

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think it's better that we keep the similar behavior for emptyValue as 
nullValue when reading.

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as "NULL" strings in 
csv files and *{color:#de350b}"NULL" strings in csv files can be parsed as null 
columns{color}* in dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", 
{code}
{color:#910091}""{color}
{code:scala}
)).toDF("make", "comment").write.option("emptyValue", "EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
When reading:
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|EMPTY|

We can find that empty columns in dataframe can be saved as "EMPTY" strings in 
csv files, *{color:#de350b}but "EMPTY" strings in csv files can not be parsed 
as empty columns{color}* in dataframe. That is:
{noformat}
When writing, convert "" empty(in dataframe) to emptyValue(in csv)
When reading, convert "\"\"" quoted empty strings to emptyValue(in dataframe)
{noformat}
 

There is obvious difference between nullValue and emptyValue in read handling. 
For nullValue, we try to convert nothing or nullValue strings to null in 
dataframe, but for emptyValue, we just try to convert "\"\""(quoted empty 
strings) to emptyValue rather than to convert both "\"\""(quoted empty strings) 
and emptyValue strings to ""(empty) in dataframe.

I think 

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

{*}For the nullValue option{*}, according to the features description in 
spark-csv readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as NULL strings in csv 
files and NULL strings in csv files can be parsed as columns of null values in 
dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string.
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string.{noformat}
There is a difference when reading. In univocity, nothing content would be 
convert to nullValue strings. But In Spark, we finally convert nothing content 
or nullValue strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 



 

{*}For the emptyValue option{*},  we add a emptyValueInRead option for reading 
and a emptyValueInWrite option for writing. I found that both Spark keeps the 
same behaviors for emptyValue with univocity.
{noformat}
When reading, if the parser does not read any character from the input, and the 
input is within quotes, the empty is used instead of an empty string.

When writing, if the writer has an empty String to write to the output, the 
emptyValue is used instead of an empty string.{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as NULL strings in csv 
files and NULL strings in csv files can be parsed as columns of null values in 
dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
 
{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

For the nullValue option, according to the features description in spark-csv 
readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

For the nullValue option, according to the features description in spark-csv 
readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as NULL strings in csv 
files and NULL strings in csv files can be parsed as columns of null values in 
dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in depended component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string{noformat}
There is a difference when reading. In univocity, nothing would be convert to 
nullValue strings. But In Spark, we finally convert nothing or nullValue 
strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

For the emptyValue option,  we add a emptyValueInRead option for reading and a 
emptyValueInWrite option for writing.
{noformat}
*no* further _formatting_ is done here{noformat}
For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" 
string.
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
we actually get the DataFrame which data is shown as:
||make||comment||
|tesla|EMPTY|

but the DataFrame which data should be shown as below as  expected:
||make||comment||
|tesla| |

I found that Spark keeps the same behavior with the depended component 
univocity.

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
 
{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

For the nullValue option, according to the features description in spark-csv 
readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

For the nullValue option, according to the features description in spark-csv 
readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as NULL strings in csv 
files and NULL strings in csv files can be parsed as columns of null values in 
dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string{noformat}
There is a difference when reading. In univocity, nothing would be convert to 
nullValue strings. But In Spark, we finally convert nothing or nullValue 
strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

For the emptyValue option,  we add a emptyValueInRead option for reading and a 
emptyValueInWrite option for writing.

I found that 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" 
string.
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
we actually get the DataFrame which data is shown as:
||make||comment||
|tesla|EMPTY|

but the DataFrame which data should be shown as below as  expected:
||make||comment||
|tesla| |

{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

For the nullValue option, according to the features description in spark-csv 
readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as NULL strings in csv 
files and NULL 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

For the nullValue option, according to the features description in spark-csv 
readme file, it is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as NULL strings in csv 
files and NULL strings in csv files can be parsed as columns of null values in 
dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string{noformat}
There is a difference when reading. In univocity, nothing would be convert to 
nullValue strings. But In Spark, we finally convert nothing or nullValue 
strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" 
string.
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
we actually get the DataFrame which data is shown as:
||make||comment||
|tesla|EMPTY|

but the DataFrame which data should be shown as below as  expected:
||make||comment||
|tesla| |

{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

According to the features description in spark-csv readme file, the nullValue 
option is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as NULL strings in csv 
files and NULL strings in csv files can be parsed as columns of null values in 
dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to 

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue SPARK-12833 and PR 
[10766|https://github.com/apache/spark/pull/10766] .

According to the features description in spark-csv readme file, the nullValue 
option is designed as:
{noformat}
When reading files:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}
For example, when writing:
{code:scala}
Seq(("Tesla", null:String)).toDF("make", "comment").write.option("nullValue", 
"NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
When reading:
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
The parsed dataframe is shown as:
||make||comment||
|tesla|null|

We can find that null columns in dataframe can be saved as NULL strings in csv 
files and NULL strings in csv files can be parsed as columns of null values in 
dataframe. That is:
{noformat}
When writing, convert null(in dataframe) to nullValue(in csv)
When reading, convert nullValue or nothing(in csv) to null(in dataframe)
{noformat}
But actually, the option nullValue in component univocity's 
{*}_CommonSettings_{*}, is designed as that:
{noformat}
when reading, if the parser does not read any character from the input, the 
nullValue is used instead of an empty string
when writing, if the writer has a null object to write to the output, the 
nullValue is used instead of an empty string{noformat}
There is a difference when reading. In univocity, nothing would be convert to 
nullValue strings. But In Spark, we finally convert nothing or nullValue 
strings to null in *_UnivocityParser_ _nullSafeDatum_* method:
{code:java}
private def nullSafeDatum(
 datum: String,
 name: String,
 nullable: Boolean,
 options: CSVOptions)(converter: ValueConverter): Any = {
  if (datum == options.nullValue || datum == null) {
if (!nullable) {
  throw QueryExecutionErrors.foundNullValueForNotNullableFieldError(name)
}
null
  } else {
converter.apply(datum)
  }
} {code}
 

 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" 
string.
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
we actually get the DataFrame which data is shown as:
||make||comment||
|tesla|EMPTY|

but the DataFrame which data should be shown as below as  expected:
||make||comment||
|tesla| |

{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue 
[SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR 
[10766|https://github.com/apache/spark/pull/10766] .

According to databricks spark-csv's features description in readme file, the 
nullValue option is designed as:
{noformat}
When reading files the API accepts several options:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files the API accepts several options:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}


For null values, the parameter nullValue can be set when reading or writing  in 
CSVOptions:
{code:scala}
// For writing, convert: null(dataframe) => nullValue(csv)

// For reading, convert: nullValue or ,,(csv) => null(dataframe)
{code}
For  example, a column has null values, if nullValue is set to "null" string.
{code:scala}
Seq(("Tesla", null.asInstanceOf[String])).toDF("make", 
"comment").write.option("nullValue", "NULL").csv(path){code}
The saved csv file is shown as:

[jira] [Updated] (SPARK-37604) The option emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue 
[SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR 
[10766|https://github.com/apache/spark/pull/10766] .

According to databricks spark-csv's features description in readme file, the 
nullValue option is designed as:
{noformat}
When reading files the API accepts several options:
nullValue: specifies a string that indicates a null value, any fields matching 
this string will be set as nulls in the DataFrame

When writing files the API accepts several options:
nullValue: specifies a string that indicates a null value, nulls in the 
DataFrame will be written as this string.
{noformat}


For null values, the parameter nullValue can be set when reading or writing  in 
CSVOptions:
{code:scala}
// For writing, convert: null(dataframe) => nullValue(csv)

// For reading, convert: nullValue or ,,(csv) => null(dataframe)
{code}
For  example, a column has null values, if nullValue is set to "null" string.
{code:scala}
Seq(("Tesla", null.asInstanceOf[String])).toDF("make", 
"comment").write.option("nullValue", "NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
and if we read this csv file with nullValue set to "null" string.
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
we can get the DataFrame which data is same with the original shown as:
||make||comment||
|tesla|null|

{color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color}

 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" 
string.
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
we actually get the DataFrame which data is shown as:
||make||comment||
|tesla|EMPTY|

but the DataFrame which data should be shown as below as  expected:
||make||comment||
|tesla| |

{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}

  was:
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue 
[SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR 
[10766|https://github.com/apache/spark/pull/10766] .

In databricks spark-csv, 

For null values, the parameter nullValue can be set when reading or writing  in 
CSVOptions:
{code:scala}
// For writing, convert: null(dataframe) => nullValue(csv)

// For reading, convert: nullValue or ,,(csv) => null(dataframe)
{code}
For  example, a column has null values, if nullValue is set to "null" string.
{code:scala}
Seq(("Tesla", null.asInstanceOf[String])).toDF("make", 
"comment").write.option("nullValue", "NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
and if we read this csv file with nullValue set to "null" string.
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
we can get the DataFrame which data is same with the original shown as:
||make||comment||
|tesla|null|

{color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color}

 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", "")).toDF("make", 

[jira] [Updated] (SPARK-37604) The parameter emptyValueInRead(in CSVOptions) is suggested to be designed as that any fields matching this string will be set as empty values "" when reading

2021-12-15 Thread Wei Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Guo updated SPARK-37604:

Description: 
The csv data format is imported from databricks 
[spark-csv|https://github.com/databricks/spark-csv] by issue 
[SPARK-12833|https://issues.apache.org/jira/browse/SPARK-12833] and PR 
[10766|https://github.com/apache/spark/pull/10766] .

In databricks spark-csv, 

For null values, the parameter nullValue can be set when reading or writing  in 
CSVOptions:
{code:scala}
// For writing, convert: null(dataframe) => nullValue(csv)

// For reading, convert: nullValue or ,,(csv) => null(dataframe)
{code}
For  example, a column has null values, if nullValue is set to "null" string.
{code:scala}
Seq(("Tesla", null.asInstanceOf[String])).toDF("make", 
"comment").write.option("nullValue", "NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
and if we read this csv file with nullValue set to "null" string.
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
we can get the DataFrame which data is same with the original shown as:
||make||comment||
|tesla|null|

{color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color}

 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" 
string.
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
we actually get the DataFrame which data is shown as:
||make||comment||
|tesla|EMPTY|

but the DataFrame which data should be shown as below as  expected:
||make||comment||
|tesla| |

{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}

  was:
Csv data format is 
For null values, the parameter nullValue can be set when reading or writing  in 
CSVOptions:
{code:scala}
// For writing, convert: null(dataframe) => nullValue(csv)

// For reading, convert: nullValue or ,,(csv) => null(dataframe)
{code}
For  example, a column has null values, if nullValue is set to "null" string.
{code:scala}
Seq(("Tesla", null.asInstanceOf[String])).toDF("make", 
"comment").write.option("nullValue", "NULL").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,NULL
{noformat}
and if we read this csv file with nullValue set to "null" string.
{code:java}
spark.read.option("nullValue", "NULL").csv(path).show()
{code}
we can get the DataFrame which data is same with the original shown as:
||make||comment||
|tesla|null|

{color:#57d9a3}*We can succeed to recovery it to the original DataFrame.*{color}

 

Since Spark 2.4, for empty strings, there are  emptyValueInRead for reading and 
emptyValueInWrite for writing that can be set in CSVOptions:
{code:scala}
// For writing, convert: ""(dataframe) => emptyValueInWrite(csv)

// For reading, convert: "" (csv) => emptyValueInRead(dataframe){code}
I think the read handling is not suitable, we can not convert "" or 
`{color:#172b4d}emptyValueInWrite`{color} values as ""(real empty strings) but 
get {color:#172b4d}emptyValueInRead's setting value actually{color}, it 
supposed to be as flows:
{code:scala}
// For reading, convert: "" or emptyValueInRead (csv) => ""(dataframe){code}
For example,  a column has empty strings, if emptyValueInWrite is set to 
"EMPTY" string.
{code:scala}
Seq(("Tesla", "")).toDF("make", "comment").write.option("emptyValue", 
"EMPTY").csv(path){code}
The saved csv file is shown as:
{noformat}
Tesla,EMPTY {noformat}
and if we read this csv file with emptyValue(emptyValueInRead) set to "EMPTY" 
string.
{code:java}
spark.read.option("emptyValue", "EMPTY").csv(path).show()
{code}
we actually get the DataFrame which data is shown as:
||make||comment||
|tesla|EMPTY|

but the DataFrame which data should be shown as below as  expected:
||make||comment||
|tesla| |

{color:#de350b}*We can not  recovery it to the original DataFrame.*{color}


> The parameter emptyValueInRead(in CSVOptions) is suggested to be designed as 
> that any fields matching this string will be set as empty values "" when 
> reading
> 

  1   2   >