CT from parquet to CSV seems to not properly encode to UTF8

Carlos Derich Mon, 16 Jul 2018 11:50:12 -0700

Hello guys, hope everyone is well.

I am having an encoding issue when converting a table from parquet into csv
files, I wonder if someone could shed some light on it ?


One of my data sets has data in French with lots of accentuation, and it is
persisted in HDFS as parquet.


When I query the parquet table with: *select `city` from
dfs.parquets.`file` , *it properly return the data encoded.


*city*

*Montréal*


Then I convert this table into a CSV file with the following query:

*alter session set `store.format`='csv'*
*create table dfs.csvs.`converted` as select * from dfs.parquets.`file`*


Then when I run a select query on it, it returns data not properly encoded:

*select columns[0] from dfs.csvs.`converted`*

Returns:

*Montr?al*


My storage plugin is pretty standard:

"csv" : {
"type" : "text",
"extensions" : [ "csv" ],
"delimiter" : ",",
"skipFirstLine": true
},

Should I explicitly add an charset option somewhere ? Couldn't find
anything helpful on the docs.

Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS
-Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck.

Have anyone ran into similar issues ?

Thank you !

CT from parquet to CSV seems to not properly encode to UTF8

Reply via email to