[ 
https://issues.apache.org/jira/browse/IMPALA-5675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846532#comment-16846532
 ] 

David Royo commented on IMPALA-5675:
------------------------------------

We are facing this issue in a project at the moment. The german umlaut 
characters make the varchars to be truncated.

We have all tables stored in UTF-8 in parquet as strings, and metadata 
containing the _real_ varchar length (As it was in the source system)

We have used string types everywhere for a while, but we introduced the 
varchars because the customer is connecting from SAS to Impala via JDBC.

SAS has horrible performance with unbounded strings (Which it handles as 
varchars of max length) and advices all Impala users to use Varchar instead.

Seeing how this is not resolved in Impala nor SAS sides, we are thinking about 
doubling the length of the varchars.

That theoretically (Cross fingers) should mean a tiny little bit worse 
performance in SAS, but at least usable Varchars.

Any word of advice?

> Wrong results when querying tables with CHAR/VARCHAR datatypes
> --------------------------------------------------------------
>
>                 Key: IMPALA-5675
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5675
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 2.7.0
>         Environment: Cloudera distro 5.10.1
>            Reporter: Branislav Lukáč
>            Priority: Major
>         Attachments: Hive_query.png, Impala_query.png
>
>
> We have created external table with the following query:
> CREATE EXTERNAL TABLE IF NOT EXISTS SAPNSQ.ZAP_GL_EX_IM_CSV ( GLREQUEST 
> DECIMAL(30), KNUMC STRING, FACCP STRING, FCHAR VARCHAR(20), FCLNT VARCHAR(3), 
> FCUKY STRING, FCURR DOUBLE, FDATS STRING, FDEC DECIMAL(8, 2), FFLTP FLOAT, 
> FINT1 TINYINT, FINT2 SMALLINT, FINT4 BIGINT, FLANG STRING, FPREC DOUBLE, 
> FQUAN DOUBLE, FTIMS STRING, FUNIT STRING, FSSTRING STRING, FCHAR40 
> VARCHAR(40) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS 
> TEXTFILE LOCATION 
> "hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051"
>  
> CSV files are already present on specified location 
> hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051
>  
> When we execute Select fchar40 FROM sapnsq.zap_gl_ex_im_csv ORDER BY fchar40 
> with both Hive and Impala, we get different results:
> - Hive (see Hive_query.png)
> - Impala (see Impala_query.png)
> Seems that Impala engine is truncating strings when they contain non-ASCII 
> characters.
> So if a character is encoded with 2 bytes, Impala counts it as 2 chars 
> (instead of 1).
> Then the  FCHAR40 VARCHAR(40) will actually return less than 40 characters.
>  
> Example:
> 1st row contains 3 special characters: É, Ï and ü
> Select with Impala truncates the result by 3 characters.
> According to Impala documentation 
> (https://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_varchar.html),
>  Unicode should be supported:
> "All data in CHAR and VARCHAR columns must be in a character encoding that is 
> compatible with UTF-8"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to