The quote options seem to be related to escaping quotes and the dataset
isn't escaaping quotes. As I said quoted strings with embedded commas is
something that pandas handles easily, and even Excel does that as well.
Femi
On Sun, Nov 6, 2016 at 6:59 AM, Hyukjin Kwon wrote:
> Hi Femi,
>
> Have you maybe tried the quote related options specified in the
> documentation?
>
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.
> html#pyspark.sql.DataFrameReader.csv
>
> Thanks.
>
> 2016-11-06 6:58 GMT+09:00 Femi Anthony :
>
>> Hi, I am trying to process a very large comma delimited csv file and I am
>> running into problems.
>> The main problem is that some fields contain quoted strings with embedded
>> commas.
>> It seems as if PySpark is unable to properly parse lines containing such
>> fields like say Pandas does.
>>
>> Here is the code I am using to read the file in Pyspark
>>
>> df_raw=spark.read.option("header","true").csv(csv_path)
>>
>> Here is an example of a good and 'bad' line in such a file:
>>
>>
>> col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col
>> 12,col13,col14,col15,col16,col17,col18,col19
>> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
>> ""W"" JK, RE LK",SOMETHINGLIKEAPHENOMENON#Y
>> OUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023
>> -05-17,CODERED
>> 6167229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUT
>> KAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,201
>> 9-11-23,CODEBLUE
>>
>> Line 0 is the header
>> Line 1 is the 'problematic' line
>> Line 2 is a good line.
>>
>> Pandas can handle this easily:
>>
>>
>> [1]: import pandas as pd
>>
>> In [2]: pdf = pd.read_csv('malformed_data.csv')
>>
>> In [4]: pdf[['col12','col13','col14']]
>> Out[4]:
>> col12
>> col13 \
>> 0 32 XIY "W" JK, RE LK SOMETHINGLIKEAPHENOMENON#YOUG
>> OTSOUL~BRINGDANOISE
>> 1 NaN
>> OUTKAST#THROOTS~WUTANG#RUNDMC
>>
>>col14
>> 0 23.0
>> 10.0
>>
>>
>> while Pyspark seems to parse this erroneously:
>>
>> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed
>> _data.csv',header=True)
>>
>> [6]: sdf.select("col12","col13",'col14').show()
>> +--+++
>> | col12| col13| col14|
>> +--+++
>> |"32 XIY ""W"" JK| RE LK"|SOMETHINGLIKEAPHE...|
>> | null|OUTKAST#THROOTS~W...| 0.0|
>> +--+++
>>
>> Is this a bug or am I doing something wrong ?
>> I am working with Spark 2.0
>> Any help is appreciated
>>
>> Thanks,
>> -- Femi
>>
>> http://www.nextmatrix.com
>> "Great spirits have always encountered violent opposition from mediocre
>> minds." - Albert Einstein.
>>
>
>
--
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.