Hi Femi, Have you maybe tried the quote related options specified in the documentation?
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv Thanks. 2016-11-06 6:58 GMT+09:00 Femi Anthony <femib...@gmail.com>: > Hi, I am trying to process a very large comma delimited csv file and I am > running into problems. > The main problem is that some fields contain quoted strings with embedded > commas. > It seems as if PySpark is unable to properly parse lines containing such > fields like say Pandas does. > > Here is the code I am using to read the file in Pyspark > > df_raw=spark.read.option("header","true").csv(csv_path) > > Here is an example of a good and 'bad' line in such a file: > > > col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11, > col12,col13,col14,col15,col16,col17,col18,col19 > 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY > ""W"" JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0, > cyclingstats,2012-25-19,432,2023-05-17,CODERED > 61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,, > OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333, > 2019-11-23,CODEBLUE > > Line 0 is the header > Line 1 is the 'problematic' line > Line 2 is a good line. > > Pandas can handle this easily: > > > [1]: import pandas as pd > > In [2]: pdf = pd.read_csv('malformed_data.csv') > > In [4]: pdf[['col12','col13','col14']] > Out[4]: > col12 > col13 \ > 0 32 XIY "W" JK, RE LK SOMETHINGLIKEAPHENOMENON# > YOUGOTSOUL~BRINGDANOISE > 1 NaN OUTKAST#THROOTS~WUTANG#RUNDMC > > col14 > 0 23.0 > 1 0.0 > > > while Pyspark seems to parse this erroneously: > > [5]: sdf=spark.read.format("org.apache.spark.csv").csv(' > malformed_data.csv',header=True) > > [6]: sdf.select("col12","col13",'col14').show() > +------------------+--------------------+--------------------+ > | col12| col13| col14| > +------------------+--------------------+--------------------+ > |"32 XIY ""W"" JK| RE LK"|SOMETHINGLIKEAPHE...| > | null|OUTKAST#THROOTS~W...| 0.0| > +------------------+--------------------+--------------------+ > > Is this a bug or am I doing something wrong ? > I am working with Spark 2.0 > Any help is appreciated > > Thanks, > -- Femi > > http://www.nextmatrix.com > "Great spirits have always encountered violent opposition from mediocre > minds." - Albert Einstein. >