It seems to be an issue only with CSV/TSV files.
Tried writing the output as JSON and it handles the encoding properly.
alter session set `store.format`='json'
create table dfs.tmp.test3 as select `city` from dfs.parquets.`file`
Returns:
{"city": "Montréal"}
additional info:
parquet-tools schema:
message root {
optional binary city (UTF8);
}
On Mon, Jul 16, 2018 at 2:49 PM, Carlos Derich <[email protected]>
wrote:
> Hello guys, hope everyone is well.
>
> I am having an encoding issue when converting a table from parquet into
> csv files, I wonder if someone could shed some light on it ?
>
> One of my data sets has data in French with lots of accentuation, and it
> is persisted in HDFS as parquet.
>
>
> When I query the parquet table with: *select `city` from
> dfs.parquets.`file` , *it properly return the data encoded.
>
>
> *city*
>
> *Montréal*
>
>
> Then I convert this table into a CSV file with the following query:
>
> *alter session set `store.format`='csv'*
> *create table dfs.csvs.`converted` as select * from dfs.parquets.`file`*
>
>
> Then when I run a select query on it, it returns data not properly encoded:
>
> *select columns[0] from dfs.csvs.`converted`*
>
> Returns:
>
> *Montr?al*
>
>
> My storage plugin is pretty standard:
>
> "csv" : {
> "type" : "text",
> "extensions" : [ "csv" ],
> "delimiter" : ",",
> "skipFirstLine": true
> },
>
> Should I explicitly add an charset option somewhere ? Couldn't find
> anything helpful on the docs.
>
> Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS
> -Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck.
>
> Have anyone ran into similar issues ?
>
> Thank you !
>