Re: Incorrect csv parsing when delimiter used within the data

Sean Owen Tue, 03 Jan 2023 12:39:39 -0800

Why does the data even need cleaning? That's all perfectly correct. The
error was setting quote to be an escape char.


On Tue, Jan 3, 2023, 2:32 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> if you take your source CSV as below
>
> "a","b","c"
> "1","",","
> "2","","abc"
>
>
> and define your code as below
>
>
>    csv_file="hdfs://rhes75:9000/data/stg/test/testcsv.csv"
>     # read hive table in spark
>     listing_df =
> spark.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load(csv_file)
>     listing_df.printSchema()
>     print(f"""\n Reading from Hive table {csv_file}\n""")
>     listing_df.show(100,False)
>     listing_df.select("c").show()
>
>
> results in
>
>
>  Reading from Hive table hdfs://rhes75:9000/data/stg/test/testcsv.csv
>
> +---+----+---+
> |a  |b   |c  |
> +---+----+---+
> |1  |null|,  |
> |2  |null|abc|
> +---+----+---+
>
> +---+
> |  c|
> +---+
> |  ,|
> |abc|
> +---+
>
>
> which assumes that "," is a value for column c in row 1
>
>
> This interpretation is correct. You ought to do data cleansing before.
>
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 3 Jan 2023 at 17:03, Sean Owen <sro...@gmail.com> wrote:
>
>> No, you've set the escape character to double-quote, when it looks like
>> you mean for it to be the quote character (which it already is). Remove
>> this setting, as it's incorrect.
>>
>> On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati
>> <saurabh.gul...@fedex.com.invalid> wrote:
>>
>>> Hello,
>>> We are seeing a case with csv data when it parses csv data incorrectly.
>>> The issue can be replicated using the below csv data
>>>
>>> "a","b","c"
>>> "1","",","
>>> "2","","abc"
>>>
>>> and using the spark csv read command.
>>>
>>> df = spark.read.format("csv")\
>>> .option("multiLine", True)\
>>> .option("escape", '"')\
>>> .option("enforceSchema", False) \
>>> .option("header", True)\
>>> .load(f"/tmp/test.csv")
>>>
>>> df.show(100, False) # prints both rows
>>> |a  |b       |c  |
>>> +---+--------+---+
>>> |1  |null    |,  |
>>> |2  |null    |abc|
>>>
>>> df.select("c").show() # merges last column of first row and first
>>> column of second row
>>> +------+
>>> |     c|
>>> +------+
>>> |"\n"2"|
>>>
>>> print(df.count()) # prints 1, should be 2
>>>
>>>
>>> It feels like a bug and I thought of asking the community before
>>> creating a bug on jira.
>>>
>>> Mvg/Regards
>>> Saurabh
>>>
>>>

Re: Incorrect csv parsing when delimiter used within the data

Reply via email to