[ https://issues.apache.org/jira/browse/SPARK-40584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tarique Anwer updated SPARK-40584: ---------------------------------- Priority: Major (was: Minor) > Incorrect Count when reading CSV file > ------------------------------------- > > Key: SPARK-40584 > URL: https://issues.apache.org/jira/browse/SPARK-40584 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.1.2 > Reporter: Tarique Anwer > Priority: Major > > I'm trying to read the below data from a CSV file and end up with a wrong > count, although the dataframe contains all the records below. > df_inputfile.count() prints 3 although it should have been 4. > {code:java} > B1123451020-502,"","{""m"": {""difference"": > 60}}","","","",2022-02-12T15:40:00.783Z > B1456741975-266,"","{""m"": {""difference"": > 60}}","","","",2022-02-04T17:03:59.566Z > B1789753479-460,"","",",","","",2022-02-18T14:46:57.332Z > B1456741977-123,"","{""m"": {""difference"": > 60}}","","","",2022-02-04T17:03:59.566Z {code} > Here's the code: > {code:java} > df_inputfile = (spark.read.format("com.databricks.spark.csv") > .option("inferSchema", "true") > .option("header","false") > .option("quotedstring",'\"') > .option("escape",'\"') > .option("multiline","true") > .option("delimiter",",") > .load('<path to csv>')) > print(df_inputfile.count()) # Prints 3 > print(df_inputfile.distinct().count()) # Prints 4 {code} > Adding a cache() statement before the count results in correct output. > Removing the option 'escape' also results in a correct count. > {noformat} > option("escape",'\"'){noformat} > It looks like this is happening because of the single comma in the 4th column > of the 3rd row. Can someone please explain what's going on? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org