You can do the conversion of character set (is this the issue?) as part of your loading process in Spark. As far as i know the spark CSV package is based on Hadoop TextFileInputformat. This format to my best of knowledge supports only utf-8. So you have to do a conversion from windows to utf-8. If you refer to language specific settings (numbers, dates etc) - this is also not supported.
I started to work on the hadoopoffice library (which you can use with Spark) where you can read Excel files directly (https://github.com/ZuInnoTe/hadoopoffice).However, there is no official release - yet. There you can specify also the language in which you want to represent data values, numbers etc. when reading the file. > On 17 Nov 2016, at 14:11, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Hi, > > In the past with Databricks package for csv files on occasions I had to do > some cleaning at Linux directory level before ingesting CSV file into HDFS > staging directory for Spark to read it. > > I have a more generic issue that may have to be ready. > > Assume that a provides using FTP to push CSV files into Windows directories. > The whole solution is built around windows and .NET. > > Now you want to ingest those files into HDFS and process them with Spark CSV. > > One can create NFS directories visible to Windows server and HDFS as well. > However, there may be issues with character sets etc. What are the best ways > of handling this? One way would be to use some scripts to make these > spreadsheet time files compatible with Linux and then load them into HDFS. > For example I know that if I saved a Excel spresheet file with DOS FORMAT, > that file will work OK with Spark CSV. Are there tools to do this as well? > > Thanks > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. >