Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Femi Anthony
The quote options seem to be related to escaping quotes and the dataset
isn't escaaping quotes. As I said quoted strings with embedded commas is
something that pandas handles easily, and even Excel does that as well.


Femi

On Sun, Nov 6, 2016 at 6:59 AM, Hyukjin Kwon  wrote:

> Hi Femi,
>
> Have you maybe tried the quote related options specified in the
> documentation?
>
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.
> html#pyspark.sql.DataFrameReader.csv
>
> Thanks.
>
> 2016-11-06 6:58 GMT+09:00 Femi Anthony :
>
>> Hi, I am trying to process a very large comma delimited csv file and I am
>> running into problems.
>> The main problem is that some fields contain quoted strings with embedded
>> commas.
>> It seems as if PySpark is unable to properly parse lines containing such
>> fields like say Pandas does.
>>
>> Here is the code I am using to read the file in Pyspark
>>
>> df_raw=spark.read.option("header","true").csv(csv_path)
>>
>> Here is an example of a good and 'bad' line in such a file:
>>
>>
>> col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col
>> 12,col13,col14,col15,col16,col17,col18,col19
>> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
>> ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#Y
>> OUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023
>> -05-17,CODERED
>> 6167229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUT
>> KAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,201
>> 9-11-23,CODEBLUE
>>
>> Line 0 is the header
>> Line 1 is the 'problematic' line
>> Line 2 is a good line.
>>
>> Pandas can handle this easily:
>>
>>
>> [1]: import pandas as pd
>>
>> In [2]: pdf = pd.read_csv('malformed_data.csv')
>>
>> In [4]: pdf[['col12','col13','col14']]
>> Out[4]:
>> col12
>> col13  \
>> 0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#YOUG
>> OTSOUL~BRINGDANOISE
>> 1 NaN
>> OUTKAST#THROOTS~WUTANG#RUNDMC
>>
>>col14
>> 0   23.0
>> 10.0
>>
>>
>> while Pyspark seems to parse this erroneously:
>>
>> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed
>> _data.csv',header=True)
>>
>> [6]: sdf.select("col12","col13",'col14').show()
>> +--+++
>> | col12|   col13|   col14|
>> +--+++
>> |"32 XIY ""W""   JK|  RE LK"|SOMETHINGLIKEAPHE...|
>> |  null|OUTKAST#THROOTS~W...| 0.0|
>> +--+++
>>
>>  Is this a bug or am I doing something wrong ?
>>  I am working with Spark 2.0
>>  Any help is appreciated
>>
>> Thanks,
>> -- Femi
>>
>> http://www.nextmatrix.com
>> "Great spirits have always encountered violent opposition from mediocre
>> minds." - Albert Einstein.
>>
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Hyukjin Kwon
Hi Femi,

Have you maybe tried the quote related options specified in the
documentation?

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv

Thanks.

2016-11-06 6:58 GMT+09:00 Femi Anthony :

> Hi, I am trying to process a very large comma delimited csv file and I am
> running into problems.
> The main problem is that some fields contain quoted strings with embedded
> commas.
> It seems as if PySpark is unable to properly parse lines containing such
> fields like say Pandas does.
>
> Here is the code I am using to read the file in Pyspark
>
> df_raw=spark.read.option("header","true").csv(csv_path)
>
> Here is an example of a good and 'bad' line in such a file:
>
>
> col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,
> col12,col13,col14,col15,col16,col17,col18,col19
> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
> ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,
> cyclingstats,2012-25-19,432,2023-05-17,CODERED
> 6167229561918,137.12,U,8234971771,,,woodstock,,,T4,,,
> OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,
> 2019-11-23,CODEBLUE
>
> Line 0 is the header
> Line 1 is the 'problematic' line
> Line 2 is a good line.
>
> Pandas can handle this easily:
>
>
> [1]: import pandas as pd
>
> In [2]: pdf = pd.read_csv('malformed_data.csv')
>
> In [4]: pdf[['col12','col13','col14']]
> Out[4]:
> col12
> col13  \
> 0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#
> YOUGOTSOUL~BRINGDANOISE
> 1 NaN OUTKAST#THROOTS~WUTANG#RUNDMC
>
>col14
> 0   23.0
> 10.0
>
>
> while Pyspark seems to parse this erroneously:
>
> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('
> malformed_data.csv',header=True)
>
> [6]: sdf.select("col12","col13",'col14').show()
> +--+++
> | col12|   col13|   col14|
> +--+++
> |"32 XIY ""W""   JK|  RE LK"|SOMETHINGLIKEAPHE...|
> |  null|OUTKAST#THROOTS~W...| 0.0|
> +--+++
>
>  Is this a bug or am I doing something wrong ?
>  I am working with Spark 2.0
>  Any help is appreciated
>
> Thanks,
> -- Femi
>
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>