Martin Rueckl created SPARK-46876:
-------------------------------------

             Summary: Data is silently lost in Tab separated CSV with empty 
(whitespace) rows
                 Key: SPARK-46876
                 URL: https://issues.apache.org/jira/browse/SPARK-46876
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 3.4.1
            Reporter: Martin Rueckl


When reading a tab separated file that contains lines that only contain tabs 
(i.e. empty strings as values of the columns for that row), then these rows 
will silently be skipped (as empty lines) and the resulting dataframe will have 
less rows than expected.

This behavior is inconsistent with the behavior for e.g. semicolon separated 
files, where the resulting dataframe will have a row with only empty string 
values.

A minimal reproducible example would look like:

A minimal reproducible example: A file containing this
{{{{}}}}
{code:java}
a\tb\tc\r\n
\t\t\r\n
1\t2\t3{code}
will create a dataframe with one row (a=1,b=2,c=3)
whereas this
{code:java}
a;b;c\r\n
;;\r\n
1;2;3{code}
will read as two rows (first row contains empty strings)

 

I used the following pyspark command to read the dataframes

{code:java}
 spark.read.option("header","true").option("sep","\t").csv("<tabseparated 
file>").collect()
spark.read.option("header","true").option("sep",";").csv("<semicolon 
file>").collect()
{code}
 

I ran into this particularly on databricks (I assume they use the same reader), 
but [this stack overflow 
post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858]
 indicates, that this is an old issue that may have been taken over from 
databricks when their csv reader was adopted in this PR: 

I recommend to at least add a respective test case to the CSV reader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to