[jira] [Updated] (HIVE-14846) Char encoding does not apply to newline chars
[ https://issues.apache.org/jira/browse/HIVE-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Ivanfi updated HIVE-14846: - Description: I created and populated a table with utf-16 encoding: {noformat} hive> create external table utf16 (col1 timestamp, col2 string) row format delimited fields terminated by "," location '/tmp/utf16'; hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16'); hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség'); {noformat} Then I checked the contents of the file: {noformat} $ hadoop fs -cat /tmp/utf16/00_0 | hd fe ff 00 32 00 30 00 31 00 30 00 2d 00 30 00 31 |...2.0.1.0.-.0.1| 0010 00 2d 00 30 00 31 00 20 00 30 00 30 00 3a 00 30 |.-.0.1. .0.0.:.0| 0020 00 30 00 3a 00 30 00 30 00 2c 00 68 01 51 00 73 |.0.:.0.0.,.h.Q.s| 0030 00 e9 00 67 0a|...g.| 0035 {noformat} The newline character is represented as 0a instead of the expected 00 0a. If I do it the other way around and put correct UTF-16 files into HDFS and try to query them from Hive, I get unknown unicode chars in the output: {noformat} hive> select * from utf16; 2010-01-01 00:00:00 hőség� 2010-01-02 00:00:00 város� 2010-01-03 00:00:00 füzet� {noformat} was: I created and populated a table with utf-16 encoding: {noformat} hive> create external table utf16 (col1 timestamp, col2 string) row format delimited fields terminated by "," location '/tmp/utf16'; hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16'); hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség'); {noformat} Then I checked the contents of the file: {noformat} $ hadoop fs -cat /tmp/utf16/00_0 | hd fe ff 00 32 00 30 00 31 00 30 00 2d 00 30 00 31 |...2.0.1.0.-.0.1| 0010 00 2d 00 30 00 34 00 20 00 30 00 30 00 3a 00 30 |.-.0.4. .0.0.:.0| 0020 00 30 00 3a 00 30 00 30 00 2c 00 63 00 69 00 70 |.0.:.0.0.,.c.i.p| 0030 01 51 0a |.Q.| 0033 {noformat} The newline character is represented as 0a instead of the expected 00 0a. If I do it the other way around and put correct UTF-16 files into HDFS and try to query them from Hive, I get unknown unicode chars in the output: {noformat} hive> select * from utf16; 2010-01-01 00:00:00 hőség� 2010-01-02 00:00:00 város� 2010-01-03 00:00:00 füzet� {noformat} > Char encoding does not apply to newline chars > - > > Key: HIVE-14846 > URL: https://issues.apache.org/jira/browse/HIVE-14846 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Zoltan Ivanfi >Priority: Minor > > I created and populated a table with utf-16 encoding: > {noformat} > hive> create external table utf16 (col1 timestamp, col2 string) row format > delimited fields terminated by "," location '/tmp/utf16'; > hive> alter table utf16 set serdeproperties > ('serialization.encoding'='UTF-16'); > hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség'); > {noformat} > Then I checked the contents of the file: > {noformat} > $ hadoop fs -cat /tmp/utf16/00_0 | hd > fe ff 00 32 00 30 00 31 00 30 00 2d 00 30 00 31 |...2.0.1.0.-.0.1| > 0010 00 2d 00 30 00 31 00 20 00 30 00 30 00 3a 00 30 |.-.0.1. .0.0.:.0| > 0020 00 30 00 3a 00 30 00 30 00 2c 00 68 01 51 00 73 |.0.:.0.0.,.h.Q.s| > 0030 00 e9 00 67 0a|...g.| > 0035 > {noformat} > The newline character is represented as 0a instead of the expected 00 0a. > If I do it the other way around and put correct UTF-16 files into HDFS and > try to query them from Hive, I get unknown unicode chars in the output: > {noformat} > hive> select * from utf16; > 2010-01-01 00:00:00 hőség� > 2010-01-02 00:00:00 város� > 2010-01-03 00:00:00 füzet� > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-14846) Char encoding does not apply to newline chars
[ https://issues.apache.org/jira/browse/HIVE-14846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Ivanfi updated HIVE-14846: - Description: I created and populated a table with utf-16 encoding: {noformat} hive> create external table utf16 (col1 timestamp, col2 string) row format delimited fields terminated by "," location '/tmp/utf16'; hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16'); hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség'); {noformat} Then I checked the contents of the file: {noformat} $ hadoop fs -cat /tmp/utf16/00_0 | hd fe ff 00 32 00 30 00 31 00 30 00 2d 00 30 00 31 |...2.0.1.0.-.0.1| 0010 00 2d 00 30 00 34 00 20 00 30 00 30 00 3a 00 30 |.-.0.4. .0.0.:.0| 0020 00 30 00 3a 00 30 00 30 00 2c 00 63 00 69 00 70 |.0.:.0.0.,.c.i.p| 0030 01 51 0a |.Q.| 0033 {noformat} The newline character is represented as 0a instead of the expected 00 0a. If I do it the other way around and put correct UTF-16 files into HDFS and try to query them from Hive, I get unknown unicode chars in the output: {noformat} hive> select * from utf16; 2010-01-01 00:00:00 hőség� 2010-01-02 00:00:00 város� 2010-01-03 00:00:00 füzet� {noformat} was: I created and populated a table with utf-16 encoding: hive> create external table utf16 (col1 timestamp, col2 string) row format delimited fields terminated by "," location '/tmp/utf16'; hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16'); hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség'); Then I checked the contents of the file: $ hadoop fs -cat /tmp/utf16/00_0 | hd fe ff 00 32 00 30 00 31 00 30 00 2d 00 30 00 31 |...2.0.1.0.-.0.1| 0010 00 2d 00 30 00 34 00 20 00 30 00 30 00 3a 00 30 |.-.0.4. .0.0.:.0| 0020 00 30 00 3a 00 30 00 30 00 2c 00 63 00 69 00 70 |.0.:.0.0.,.c.i.p| 0030 01 51 0a |.Q.| 0033 The newline character is represented as 0a instead of the expected 00 0a. If I do it the other way around and put correct UTF-16 files into HDFS and try to query them from Hive, I get unknown unicode chars in the output: hive> select * from utf16; 2010-01-01 00:00:00 hőség� 2010-01-02 00:00:00 város� 2010-01-03 00:00:00 füzet� > Char encoding does not apply to newline chars > - > > Key: HIVE-14846 > URL: https://issues.apache.org/jira/browse/HIVE-14846 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Zoltan Ivanfi >Priority: Minor > > I created and populated a table with utf-16 encoding: > {noformat} > hive> create external table utf16 (col1 timestamp, col2 string) row format > delimited fields terminated by "," location '/tmp/utf16'; > hive> alter table utf16 set serdeproperties > ('serialization.encoding'='UTF-16'); > hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség'); > {noformat} > Then I checked the contents of the file: > {noformat} > $ hadoop fs -cat /tmp/utf16/00_0 | hd > fe ff 00 32 00 30 00 31 00 30 00 2d 00 30 00 31 |...2.0.1.0.-.0.1| > 0010 00 2d 00 30 00 34 00 20 00 30 00 30 00 3a 00 30 |.-.0.4. .0.0.:.0| > 0020 00 30 00 3a 00 30 00 30 00 2c 00 63 00 69 00 70 |.0.:.0.0.,.c.i.p| > 0030 01 51 0a |.Q.| > 0033 > {noformat} > The newline character is represented as 0a instead of the expected 00 0a. > If I do it the other way around and put correct UTF-16 files into HDFS and > try to query them from Hive, I get unknown unicode chars in the output: > {noformat} > hive> select * from utf16; > 2010-01-01 00:00:00 hőség� > 2010-01-02 00:00:00 város� > 2010-01-03 00:00:00 füzet� > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)