Why does the data even need cleaning? That's all perfectly correct. The error was setting quote to be an escape char.
On Tue, Jan 3, 2023, 2:32 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > if you take your source CSV as below > > "a","b","c" > "1","","," > "2","","abc" > > > and define your code as below > > > csv_file="hdfs://rhes75:9000/data/stg/test/testcsv.csv" > # read hive table in spark > listing_df = > spark.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "true").load(csv_file) > listing_df.printSchema() > print(f"""\n Reading from Hive table {csv_file}\n""") > listing_df.show(100,False) > listing_df.select("c").show() > > > results in > > > Reading from Hive table hdfs://rhes75:9000/data/stg/test/testcsv.csv > > +---+----+---+ > |a |b |c | > +---+----+---+ > |1 |null|, | > |2 |null|abc| > +---+----+---+ > > +---+ > | c| > +---+ > | ,| > |abc| > +---+ > > > which assumes that "," is a value for column c in row 1 > > > This interpretation is correct. You ought to do data cleansing before. > > > HTH > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 3 Jan 2023 at 17:03, Sean Owen <sro...@gmail.com> wrote: > >> No, you've set the escape character to double-quote, when it looks like >> you mean for it to be the quote character (which it already is). Remove >> this setting, as it's incorrect. >> >> On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati >> <saurabh.gul...@fedex.com.invalid> wrote: >> >>> Hello, >>> We are seeing a case with csv data when it parses csv data incorrectly. >>> The issue can be replicated using the below csv data >>> >>> "a","b","c" >>> "1","","," >>> "2","","abc" >>> >>> and using the spark csv read command. >>> >>> df = spark.read.format("csv")\ >>> .option("multiLine", True)\ >>> .option("escape", '"')\ >>> .option("enforceSchema", False) \ >>> .option("header", True)\ >>> .load(f"/tmp/test.csv") >>> >>> df.show(100, False) # prints both rows >>> |a |b |c | >>> +---+--------+---+ >>> |1 |null |, | >>> |2 |null |abc| >>> >>> df.select("c").show() # merges last column of first row and first >>> column of second row >>> +------+ >>> | c| >>> +------+ >>> |"\n"2"| >>> >>> print(df.count()) # prints 1, should be 2 >>> >>> >>> It feels like a bug and I thought of asking the community before >>> creating a bug on jira. >>> >>> Mvg/Regards >>> Saurabh >>> >>>