Actually, CSV datasource supports encoding option[1] (although it does not support non-ascii compatible encoding types).
[1] https://github.com/apache/spark/blob/44c8bfda793b7655e2bd1da5e9915a09ed9d42ce/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L364 On 17 Nov 2016 10:59 p.m., "ayan guha" <guha.a...@gmail.com> wrote: > There is an utility called dos2unix. You can give it a try > > On 18 Nov 2016 00:20, "Jörn Franke" <jornfra...@gmail.com> wrote: > > > > You can do the conversion of character set (is this the issue?) as part > of your loading process in Spark. > > As far as i know the spark CSV package is based on Hadoop > TextFileInputformat. This format to my best of knowledge supports only > utf-8. So you have to do a conversion from windows to utf-8. If you refer > to language specific settings (numbers, dates etc) - this is also not > supported. > > > > I started to work on the hadoopoffice library (which you can use with > Spark) where you can read Excel files directly ( > https://github.com/ZuInnoTe/hadoopoffice).However, there is no official > release - yet. There you can specify also the language in which you want to > represent data values, numbers etc. when reading the file. > > > > On 17 Nov 2016, at 14:11, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > > >> Hi, > >> > >> In the past with Databricks package for csv files on occasions I had to > do some cleaning at Linux directory level before ingesting CSV file into > HDFS staging directory for Spark to read it. > >> > >> I have a more generic issue that may have to be ready. > >> > >> Assume that a provides using FTP to push CSV files into Windows > directories. The whole solution is built around windows and .NET. > >> > >> Now you want to ingest those files into HDFS and process them with > Spark CSV. > >> > >> One can create NFS directories visible to Windows server and HDFS > as well. However, there may be issues with character sets etc. What are the > best ways of handling this? One way would be to use some scripts to make > these spreadsheet time files compatible with Linux and then load them into > HDFS. For example I know that if I saved a Excel spresheet file with DOS > FORMAT, that file will work OK with Spark CSV. Are there tools to do this > as well? > >> > >> Thanks > >> > >> > >> Dr Mich Talebzadeh > >> > >> > >> > >> LinkedIn https://www.linkedin.com/profile/view?id= > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >> > >> > >> > >> http://talebzadehmich.wordpress.com > >> > >> > >> Disclaimer: Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > >> > >> >