Yaohua Zhao created SPARK-46405:
-----------------------------------

             Summary:  Issue with CSV schema inference and malformed records
                 Key: SPARK-46405
                 URL: https://issues.apache.org/jira/browse/SPARK-46405
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.5.0
            Reporter: Yaohua Zhao


There appears to be a discrepancy in the behavior of schema inference in the 
CSV reader compared to JSON. When processing CSV files without a predefined 
schema, the mechanism to handle malformed records seems to be inconsistent. 
Unlike the JSON format, where a `_corrupt_record` column is automatically added 
in the presence of malformed records, the CSV format does not exhibit this 
behavior. This inconsistency can lead to unexpected results and data loss 
during processing.

*Steps to Reproduce:*
 # Create a CSV file with malformed records without providing a schema.
 # Observe that the `_corrupt_record` column is not automatically added to the 
final dataframe.

*Expected Result:* The `_corrupt_record` column should be automatically added 
to the final dataframe when processing a CSV file with malformed records, 
similar to the behavior observed with JSON files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to