[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608241#comment-14608241 ] xiaowei wang commented on HIVE-11095: - I am so glade to contribute code to the community . SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.3.0, 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607606#comment-14607606 ] Chengxiang Li commented on HIVE-11095: -- Hi, [~xiaowei], After get +1, it need wait 24 hours before commit to make sure others has opportunity to review as well, just the way how community works, patch looks good. SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607607#comment-14607607 ] Chengxiang Li commented on HIVE-11095: -- Hi, [~xiaowei], After get +1, it need wait 24 hours before commit to make sure others has opportunity to review as well, just the way how community works, patch looks good. SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607620#comment-14607620 ] xiaowei wang commented on HIVE-11095: - Ok,I understand!Thanks very much! SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606912#comment-14606912 ] Hive QA commented on HIVE-11095: {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12742660/HIVE-11095.3.patch.txt {color:green}SUCCESS:{color} +1 9035 tests passed Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4436/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4436/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4436/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12742660 - PreCommit-HIVE-TRUNK-Build SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606921#comment-14606921 ] xiaowei wang commented on HIVE-11095: - [~xuefuz] I add a test case ,so I need code review. The test have passed . SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607202#comment-14607202 ] xiaowei wang commented on HIVE-11095: - Thanks! SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607540#comment-14607540 ] xiaowei wang commented on HIVE-11095: - Is there a problem ? SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607197#comment-14607197 ] Xuefu Zhang commented on HIVE-11095: +1 SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt, HIVE-11095.3.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605013#comment-14605013 ] Sushanth Sowmyan commented on HIVE-11095: - Removing fix version of 1.2.0 since this is not part of the already-released 1.2.0 release. Please set appropriate commit version when this fix is committed. SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605039#comment-14605039 ] xiaowei wang commented on HIVE-11095: - Thank you for [~sushant.patil] suggestion!This bug affect 0.14,1.0,1.1,1.2. SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605045#comment-14605045 ] xiaowei wang commented on HIVE-11095: - [~brocknoland] SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 2.0.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604384#comment-14604384 ] Ashutosh Chauhan commented on HIVE-11095: - This one seems to be same issue as HIVE-2 If so, we should close this as dupe, since one on HIVE-2 has a patch which contains a test case. SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604411#comment-14604411 ] xiaowei wang commented on HIVE-11095: - This one is not the same as HIVE-2 .In 2,the patch is for method of transformTextToUTF8,In my patch, is for the method of transformTextFromUTF8. SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604433#comment-14604433 ] xiaowei wang commented on HIVE-11095: - [~ashutoshc] SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604434#comment-14604434 ] xiaowei wang commented on HIVE-11095: - This one is not the same as HIVE-2 .In 2,the patch is for method of transformTextToUTF8,In my patch, is for the method of transformTextFromUTF8. SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603272#comment-14603272 ] Hive QA commented on HIVE-11095: {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12742079/HIVE-11095.2.patch.txt {color:green}SUCCESS:{color} +1 9025 tests passed Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4395/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4395/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4395/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12742079 - PreCommit-HIVE-TRUNK-Build SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602617#comment-14602617 ] xiaowei wang commented on HIVE-11095: - According to the suggestion of Chengxiang Li ,I put up a new patch, HIVE-11095.2.patch.txt SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602714#comment-14602714 ] Chengxiang Li commented on HIVE-11095: -- [~xiaowei], this should be the same issue as HIVE-10983, normally, we desire to handle it in a single JIRA, would you like to merge this patch into HIVE-10983? SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602742#comment-14602742 ] xiaowei wang commented on HIVE-11095: - Ok,I will merge this patch into HIVE-10983 . Thanks for your suggestions! SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt, HIVE-11095.2.patch.txt {noformat} The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! The method getBytes of Text returns the raw bytes; however, only data up to Text.length is valid.A better way is use copyBytes() if you need the returned array to be precisely the length of the data. But the copyBytes is added behind hadoop1. {noformat} How I found this bug? When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602099#comment-14602099 ] Hive QA commented on HIVE-11095: {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12741611/HIVE-11095.1.patch.txt {color:green}SUCCESS:{color} +1 9025 tests passed Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4385/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4385/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4385/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12741611 - PreCommit-HIVE-TRUNK-Build SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql , {code:sql} select * from web_searchhub where logdate=2015061003 {code} the result of sql see blow.Notice that ,the second row content contains the first row content. {noformat} INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 {noformat} The content of origin lzo file content see below ,just 2 rows. {noformat} INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 {noformat} I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : {code:sql} CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-11095) SerDeUtils another bug ,when Text is reused
[ https://issues.apache.org/jira/browse/HIVE-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599430#comment-14599430 ] xiaowei wang commented on HIVE-11095: - SerDeUtils invoke a bad method of Text,getBytes()! SerDeUtils another bug ,when Text is reused Key: HIVE-11095 URL: https://issues.apache.org/jira/browse/HIVE-11095 Project: Hive Issue Type: Bug Components: API, CLI Affects Versions: 0.14.0, 1.0.0, 1.2.0 Environment: Hadoop 2.3.0-cdh5.0.0 Hive 0.14 Reporter: xiaowei wang Assignee: xiaowei wang Priority: Critical Fix For: 1.2.0 Attachments: HIVE-11095.1.patch.txt The method transformTextFromUTF8 have a error bug, It invoke a bad method of Text,getBytes()! When i query data from a lzo table , I found in results : the length of the current row is always largr than the previous row, and sometimes,the current row contains the contents of the previous row。 For example ,i execute a sql ,select * from web_searchhub where logdate=2015061003, the result of sql see blow.Notice that ,the second row content contains the first row content. INFO [03:00:05.589] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42098,session=3151,thread=254 2015061003 INFO [03:00:05.594] 18941e66-9962-44ad-81bc-3519f47ba274 session=901,thread=223ession=3151,thread=254 2015061003 The content of origin lzo file content see below ,just 2 rows. INFO [03:00:05.635] b88e0473-7530-494c-82d8-e2d2ebd2666c_forweb session=3148,thread=285 INFO [03:00:05.635] HttpFrontServer::FrontSH msgRecv:Remote=/10.13.193.68:42095,session=3148,thread=285 I think this error is caused by the Text reuse,and I found the solutions . Addicational, table create sql is : CREATE EXTERNAL TABLE `web_searchhub`( `line` string) PARTITIONED BY ( `logdate` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' U' WITH SERDEPROPERTIES ( 'serialization.encoding'='GBK') STORED AS INPUTFORMAT com.hadoop.mapred.DeprecatedLzoTextInputFormat OUTPUTFORMAT org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat; LOCATION 'viewfs://nsX/user/hive/warehouse/raw.db/web/web_searchhub' ; -- This message was sent by Atlassian JIRA (v6.3.4#6332)