Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
and 2 single quotes together'' are looking like a single double quote ".

Mvg/Regards
Saurabh Gulati

From: Saurabh Gulati 
Sent: 05 January 2023 12:24
To: Sean Owen 
Cc: User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

Its the same input except that headers are also being read with csv reader.

Mvg/Regards
Saurabh Gulati

From: Sean Owen 
Sent: 04 January 2023 15:12
To: Saurabh Gulati 
Cc: User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That does not appear to be the same input you used in your example. What is the 
contents of test.csv?

On Wed, Jan 4, 2023 at 7:45 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:
Hi @Sean Owen<mailto:sro...@gmail.com>
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show() and df.select("c").show()

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen mailto:sro...@gmail.com>>
Sent: 04 January 2023 14:25
To: Saurabh Gulati mailto:saurabh.gul...@fedex.com>>
Cc: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>; User 
mailto:user@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen<mailto:sro...@gmail.com> Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
Its the same input except that headers are also being read with csv reader.

Mvg/Regards
Saurabh Gulati

From: Sean Owen 
Sent: 04 January 2023 15:12
To: Saurabh Gulati 
Cc: User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That does not appear to be the same input you used in your example. What is the 
contents of test.csv?

On Wed, Jan 4, 2023 at 7:45 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:
Hi @Sean Owen<mailto:sro...@gmail.com>
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show() and df.select("c").show()

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen mailto:sro...@gmail.com>>
Sent: 04 January 2023 14:25
To: Saurabh Gulati mailto:saurabh.gul...@fedex.com>>
Cc: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>; User 
mailto:user@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen<mailto:sro...@gmail.com> Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Sean Owen
That does not appear to be the same input you used in your example. What is
the contents of test.csv?

On Wed, Jan 4, 2023 at 7:45 AM Saurabh Gulati 
wrote:

> Hi @Sean Owen 
> Probably the data is incorrect, and the source needs to fix it.
> But using python's csv parser returns the correct results.
>
> import csv
>
> with open("/tmp/test.csv") as c_file:
>
> csv_reader = csv.reader(c_file, delimiter=",")
> for row in csv_reader:
> print(row)
>
> ['a', 'b', 'c']
> ['1', '', ',see what "I did",\ni am still writing']
> ['2', '', 'abc']
>
> And also, I don't understand why there is a distinction in outputs from
> df.show() and df.select("c").show()
>
> Mvg/Regards
> Saurabh Gulati
> Data Platform
> --
> *From:* Sean Owen 
> *Sent:* 04 January 2023 14:25
> *To:* Saurabh Gulati 
> *Cc:* Mich Talebzadeh ; User <
> user@spark.apache.org>
> *Subject:* Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used
> within the data
>
> That input is just invalid as CSV for any parser. You end a quoted col
> without following with a col separator. What would the intended parsing be
> and how would it work?
>
> On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
> wrote:
>
>
> @Sean Owen  Also see the example below with quotes
> feedback:
>
> "a","b","c"
> "1","",",see what ""I did"","
> "2","","abc"
>
>


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Saurabh Gulati
Hi @Sean Owen<mailto:sro...@gmail.com>
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show()​ and df.select("c").show()​

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen 
Sent: 04 January 2023 14:25
To: Saurabh Gulati 
Cc: Mich Talebzadeh ; User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen<mailto:sro...@gmail.com> Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Sean Owen
That input is just invalid as CSV for any parser. You end a quoted col
without following with a col separator. What would the intended parsing be
and how would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
wrote:

>
> @Sean Owen  Also see the example below with quotes
> feedback:
>
> "a","b","c"
> "1","",",see what ""I did"","
> "2","","abc"
>
>


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Saurabh Gulati
Hey guys, much appreciate your quick responses.

To answer your questions,
@Mich Talebzadeh<mailto:mich.talebza...@gmail.com> We get data from multiple 
sources, and we don't have any control over what they put in. In this case the 
column is supposed to contain some feedback and it can also contain quoted 
strings.

@Sean Owen<mailto:sro...@gmail.com> Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"
Here if we don't escape with "

df = spark.read.option("multiLine", True).option("enforceSchema", 
False).option("header", True).csv(f"/tmp/test.csv")

df.show(100, False)

+---+++
|a  |b   |c   |
+---+++
|1  |null|",see what ""I did""|

+---+++

df.count()

1

So, we put in "? as the escape character and then its parsed fine but the count 
is wrong.


df = spark.read.option("escape", '"').option("multiLine", 
True).option("enforceSchema", False).option("header", 
True).csv(f"/tmp/test.csv")

df.show(100, False)

+---++--+
|a  |b   |c |
+---++--+
|1  |null|,see what "I did",|
|2  |null|abc   |
+---++--+

df.count()
1

I understand its a complex case or maybe an edge case which makes it difficult 
for spark
to understand when a column ends as we have even enabled multiline=True?.

See another example below which even has multiline value for column c?.

"a","b","c"
"1","",",see what ""I did"",
i am still writing"
"2","","abc"

# with escape

df = spark.read.option("escape", '"').option("multiLine", 
True).option("enforceSchema", False).option("header", 
True).csv(f"/tmp/test.csv")

df.show(10, False)

+---++--+
|a  |b   |c |
+---++--+
|1  |null|,see what "I did",\ni am still writing|
|2  |null|abc   |
+---++--+

df.count()
1

df.select("c").show(10, False)
+--+
|c |
+--+
|see what ""I did""|
|null  |
|abc   |
+--+

# without escape "


df.show(10, False)

+---+++
|a  |b   |c   |
+---+++
|1  |null|",see what ""I did""|
|i am still writing"|null|null|
|2  |null|abc |
+---+++

df.select("c").show(10, False)

+--------+
|c   |
++
|",see what ""I did""|
|null|
|abc |
++


The issue is that it can print the complete data frame correctly with escape 
enabled,
but when you select a column or ask a count then it gives wrong output.


Regards
Saurabh

From: Mich Talebzadeh 
Sent: 04 January 2023 10:14
To: Sean Owen 
Cc: Saurabh Gulati ; User 
Subject: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the 
data

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

What is the point of having  , as a column value? From a business point of view 
it does not signify anything IMO




 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!BL9GA0TyTA!erAOVE6gcxktT3dQCu6OSdzFqng9xRG1oLmXuetC6pn_3nMnzlWnC_pNmhtZMwXPc3QxaSb8w6V55rIjuRXHqVXSIPo5aQ$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!BL9GA0TyTA!erAOVE6gcxktT3dQCu6OSdzFqng9xRG1oLmXuetC6pn_3nMnzlWnC_pNmhtZMwXPc3QxaSb8w6V55rIjuRXHqVUcys0piQ$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 3 Jan 2023 at 20:39, Sean Owen 
mailto:sro...@g