[jira] [Updated] (HIVE-14846) Char encoding does not apply to newline chars

2016-09-27 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-14846:
-
Description: 
I created and populated a table with utf-16 encoding:

{noformat}
hive> create external table utf16 (col1 timestamp, col2 string) row format 
delimited fields terminated by "," location '/tmp/utf16';
hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16');
hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
{noformat}

Then I checked the contents of the file:

{noformat}
$ hadoop fs -cat /tmp/utf16/00_0 | hd
  fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
0010  00 2d 00 30 00 31 00 20  00 30 00 30 00 3a 00 30  |.-.0.1. .0.0.:.0|
0020  00 30 00 3a 00 30 00 30  00 2c 00 68 01 51 00 73  |.0.:.0.0.,.h.Q.s|
0030  00 e9 00 67 0a|...g.|
0035
{noformat}

The newline character is represented as 0a instead of the expected 00 0a.

If I do it the other way around and put correct UTF-16 files into HDFS and try 
to query them from Hive, I get unknown unicode chars in the output:

{noformat}
hive> select * from utf16;
2010-01-01 00:00:00 hőség�
2010-01-02 00:00:00 város�
2010-01-03 00:00:00 füzet�
{noformat}


  was:
I created and populated a table with utf-16 encoding:

{noformat}
hive> create external table utf16 (col1 timestamp, col2 string) row format 
delimited fields terminated by "," location '/tmp/utf16';
hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16');
hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
{noformat}

Then I checked the contents of the file:

{noformat}
$ hadoop fs -cat /tmp/utf16/00_0 | hd
  fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
0010  00 2d 00 30 00 34 00 20  00 30 00 30 00 3a 00 30  |.-.0.4. .0.0.:.0|
0020  00 30 00 3a 00 30 00 30  00 2c 00 63 00 69 00 70  |.0.:.0.0.,.c.i.p|
0030  01 51 0a  |.Q.|
0033
{noformat}

The newline character is represented as 0a instead of the expected 00 0a.

If I do it the other way around and put correct UTF-16 files into HDFS and try 
to query them from Hive, I get unknown unicode chars in the output:

{noformat}
hive> select * from utf16;
2010-01-01 00:00:00 hőség�
2010-01-02 00:00:00 város�
2010-01-03 00:00:00 füzet�
{noformat}



> Char encoding does not apply to newline chars
> -
>
> Key: HIVE-14846
> URL: https://issues.apache.org/jira/browse/HIVE-14846
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Zoltan Ivanfi
>Priority: Minor
>
> I created and populated a table with utf-16 encoding:
> {noformat}
> hive> create external table utf16 (col1 timestamp, col2 string) row format 
> delimited fields terminated by "," location '/tmp/utf16';
> hive> alter table utf16 set serdeproperties 
> ('serialization.encoding'='UTF-16');
> hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
> {noformat}
> Then I checked the contents of the file:
> {noformat}
> $ hadoop fs -cat /tmp/utf16/00_0 | hd
>   fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
> 0010  00 2d 00 30 00 31 00 20  00 30 00 30 00 3a 00 30  |.-.0.1. .0.0.:.0|
> 0020  00 30 00 3a 00 30 00 30  00 2c 00 68 01 51 00 73  |.0.:.0.0.,.h.Q.s|
> 0030  00 e9 00 67 0a|...g.|
> 0035
> {noformat}
> The newline character is represented as 0a instead of the expected 00 0a.
> If I do it the other way around and put correct UTF-16 files into HDFS and 
> try to query them from Hive, I get unknown unicode chars in the output:
> {noformat}
> hive> select * from utf16;
> 2010-01-01 00:00:00   hőség�
> 2010-01-02 00:00:00   város�
> 2010-01-03 00:00:00   füzet�
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-14846) Char encoding does not apply to newline chars

2016-09-27 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated HIVE-14846:
-
Description: 
I created and populated a table with utf-16 encoding:

{noformat}
hive> create external table utf16 (col1 timestamp, col2 string) row format 
delimited fields terminated by "," location '/tmp/utf16';
hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16');
hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
{noformat}

Then I checked the contents of the file:

{noformat}
$ hadoop fs -cat /tmp/utf16/00_0 | hd
  fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
0010  00 2d 00 30 00 34 00 20  00 30 00 30 00 3a 00 30  |.-.0.4. .0.0.:.0|
0020  00 30 00 3a 00 30 00 30  00 2c 00 63 00 69 00 70  |.0.:.0.0.,.c.i.p|
0030  01 51 0a  |.Q.|
0033
{noformat}

The newline character is represented as 0a instead of the expected 00 0a.

If I do it the other way around and put correct UTF-16 files into HDFS and try 
to query them from Hive, I get unknown unicode chars in the output:

{noformat}
hive> select * from utf16;
2010-01-01 00:00:00 hőség�
2010-01-02 00:00:00 város�
2010-01-03 00:00:00 füzet�
{noformat}


  was:
I created and populated a table with utf-16 encoding:

hive> create external table utf16 (col1 timestamp, col2 string) row format 
delimited fields terminated by "," location '/tmp/utf16';
hive> alter table utf16 set serdeproperties 
('serialization.encoding'='UTF-16');
hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');

Then I checked the contents of the file:

$ hadoop fs -cat /tmp/utf16/00_0 | hd
  fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  
|...2.0.1.0.-.0.1|
0010  00 2d 00 30 00 34 00 20  00 30 00 30 00 3a 00 30  |.-.0.4. 
.0.0.:.0|
0020  00 30 00 3a 00 30 00 30  00 2c 00 63 00 69 00 70  
|.0.:.0.0.,.c.i.p|
0030  01 51 0a  |.Q.|
0033

The newline character is represented as 0a instead of the expected 00 0a.

If I do it the other way around and put correct UTF-16 files into HDFS and try 
to query them from Hive, I get unknown unicode chars in the output:

hive> select * from utf16;
2010-01-01 00:00:00 hőség�
2010-01-02 00:00:00 város�
2010-01-03 00:00:00 füzet�



> Char encoding does not apply to newline chars
> -
>
> Key: HIVE-14846
> URL: https://issues.apache.org/jira/browse/HIVE-14846
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Zoltan Ivanfi
>Priority: Minor
>
> I created and populated a table with utf-16 encoding:
> {noformat}
> hive> create external table utf16 (col1 timestamp, col2 string) row format 
> delimited fields terminated by "," location '/tmp/utf16';
> hive> alter table utf16 set serdeproperties 
> ('serialization.encoding'='UTF-16');
> hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
> {noformat}
> Then I checked the contents of the file:
> {noformat}
> $ hadoop fs -cat /tmp/utf16/00_0 | hd
>   fe ff 00 32 00 30 00 31  00 30 00 2d 00 30 00 31  |...2.0.1.0.-.0.1|
> 0010  00 2d 00 30 00 34 00 20  00 30 00 30 00 3a 00 30  |.-.0.4. .0.0.:.0|
> 0020  00 30 00 3a 00 30 00 30  00 2c 00 63 00 69 00 70  |.0.:.0.0.,.c.i.p|
> 0030  01 51 0a  |.Q.|
> 0033
> {noformat}
> The newline character is represented as 0a instead of the expected 00 0a.
> If I do it the other way around and put correct UTF-16 files into HDFS and 
> try to query them from Hive, I get unknown unicode chars in the output:
> {noformat}
> hive> select * from utf16;
> 2010-01-01 00:00:00   hőség�
> 2010-01-02 00:00:00   város�
> 2010-01-03 00:00:00   füzet�
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)