Re: Handling windows characters with Spark CSV on Linux

Hyukjin Kwon Thu, 17 Nov 2016 06:37:38 -0800

Actually, CSV datasource supports encoding option[1] (although it does not
support non-ascii compatible encoding types).


[1]
https://github.com/apache/spark/blob/44c8bfda793b7655e2bd1da5e9915a09ed9d42ce/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L364

On 17 Nov 2016 10:59 p.m., "ayan guha" <guha.a...@gmail.com> wrote:

> There is an utility called dos2unix. You can give it a try
>
> On 18 Nov 2016 00:20, "Jörn Franke" <jornfra...@gmail.com> wrote:
> >
> > You can do the conversion of character set (is this the issue?) as part
> of your loading process in Spark.
> > As far as i know the spark CSV package is based on Hadoop
> TextFileInputformat. This format to my best of knowledge supports only
> utf-8. So you have to do a conversion from windows to utf-8. If you refer
> to language specific settings (numbers, dates etc) - this is also not
> supported.
> >
> > I started to work on the hadoopoffice library (which you can use with
> Spark) where you can read Excel files directly (
> https://github.com/ZuInnoTe/hadoopoffice).However, there is no official
> release - yet. There you can specify also the language in which you want to
> represent data values, numbers etc. when reading the file.
> >
> > On 17 Nov 2016, at 14:11, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> In the past with Databricks package for csv files on occasions I had to
> do some cleaning at Linux directory level before ingesting CSV file into
> HDFS staging directory for Spark to read it.
> >>
> >> I have a more generic issue that may have to be ready.
> >>
> >> Assume that a provides using FTP to push CSV files into Windows
> directories. The whole solution is built around windows and .NET.
> >>
> >> Now you want to ingest those files into HDFS and process them with
> Spark CSV.
> >>
> >> One can create NFS directories visible to Windows server and HDFS
> as well. However, there may be issues with character sets etc. What are the
> best ways of handling this? One way would be to use some scripts to make
> these spreadsheet time files compatible with Linux and then load them into
> HDFS. For example I know that if I saved a Excel spresheet file with DOS
> FORMAT, that file will work OK with Spark CSV.  Are there tools to do this
> as well?
> >>
> >> Thanks
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn  https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> >>
> >>
>

Re: Handling windows characters with Spark CSV on Linux

Reply via email to