Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Hyukjin Kwon
Actually, CSV datasource supports encoding option[1] (although it does not support non-ascii compatible encoding types). [1] https://github.com/apache/spark/blob/44c8bfda793b7655e2bd1da5e9915a09ed9d42ce/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L364 On 17 Nov 2016 10:59

Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Mich Talebzadeh
Thanks Ayan. That only works for extra characters like ^ characters etc. Unfortunately it does not cure specific character sets. cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread ayan guha
There is an utility called dos2unix. You can give it a try On 18 Nov 2016 00:20, "Jörn Franke" wrote: > > You can do the conversion of character set (is this the issue?) as part of your loading process in Spark. > As far as i know the spark CSV package is based on Hadoop

Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Jörn Franke
You can do the conversion of character set (is this the issue?) as part of your loading process in Spark. As far as i know the spark CSV package is based on Hadoop TextFileInputformat. This format to my best of knowledge supports only utf-8. So you have to do a conversion from windows to utf-8.

Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Mich Talebzadeh
Hi, In the past with Databricks package for csv files on occasions I had to do some cleaning at Linux directory level before ingesting CSV file into HDFS staging directory for Spark to read it. I have a more generic issue that may have to be ready. Assume that a provides using FTP to push CSV