Yaohua Zhao created SPARK-46405: ----------------------------------- Summary: Issue with CSV schema inference and malformed records Key: SPARK-46405 URL: https://issues.apache.org/jira/browse/SPARK-46405 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Yaohua Zhao
There appears to be a discrepancy in the behavior of schema inference in the CSV reader compared to JSON. When processing CSV files without a predefined schema, the mechanism to handle malformed records seems to be inconsistent. Unlike the JSON format, where a `_corrupt_record` column is automatically added in the presence of malformed records, the CSV format does not exhibit this behavior. This inconsistency can lead to unexpected results and data loss during processing. *Steps to Reproduce:* # Create a CSV file with malformed records without providing a schema. # Observe that the `_corrupt_record` column is not automatically added to the final dataframe. *Expected Result:* The `_corrupt_record` column should be automatically added to the final dataframe when processing a CSV file with malformed records, similar to the behavior observed with JSON files. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org