Actually, CSV datasource supports encoding option[1] (although it does not
support non-ascii compatible encoding types).
[1]
https://github.com/apache/spark/blob/44c8bfda793b7655e2bd1da5e9915a09ed9d42ce/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L364
On 17 Nov 2016 10:59
Thanks Ayan.
That only works for extra characters like ^ characters etc. Unfortunately
it does not cure specific character sets.
cheers
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
There is an utility called dos2unix. You can give it a try
On 18 Nov 2016 00:20, "Jörn Franke" wrote:
>
> You can do the conversion of character set (is this the issue?) as part
of your loading process in Spark.
> As far as i know the spark CSV package is based on Hadoop
You can do the conversion of character set (is this the issue?) as part of your
loading process in Spark.
As far as i know the spark CSV package is based on Hadoop TextFileInputformat.
This format to my best of knowledge supports only utf-8. So you have to do a
conversion from windows to utf-8.
Hi,
In the past with Databricks package for csv files on occasions I had to do
some cleaning at Linux directory level before ingesting CSV file into HDFS
staging directory for Spark to read it.
I have a more generic issue that may have to be ready.
Assume that a provides using FTP to push CSV