Hello guys, hope everyone is well.
I am having an encoding issue when converting a table from parquet into csv
files, I wonder if someone could shed some light on it ?
One of my data sets has data in French with lots of accentuation, and it is
persisted in HDFS as parquet.
When I query the parquet table with: *select `city` from
dfs.parquets.`file` , *it properly return the data encoded.
*city*
*Montréal*
Then I convert this table into a CSV file with the following query:
*alter session set `store.format`='csv'*
*create table dfs.csvs.`converted` as select * from dfs.parquets.`file`*
Then when I run a select query on it, it returns data not properly encoded:
*select columns[0] from dfs.csvs.`converted`*
Returns:
*Montr?al*
My storage plugin is pretty standard:
"csv" : {
"type" : "text",
"extensions" : [ "csv" ],
"delimiter" : ",",
"skipFirstLine": true
},
Should I explicitly add an charset option somewhere ? Couldn't find
anything helpful on the docs.
Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS
-Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck.
Have anyone ran into similar issues ?
Thank you !