[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778846#comment-16778846 ] BELUGA BEHR edited comment on HIVE-21240 at 2/27/19 3:44 AM: - All unit tests are passing [~bslim] [~kgyrtkirk]. Please consider this patch for inclusion into the project. I understand there is some hesitation regarding the change in return type. Previous a native array was returned and now (with this patch) a Collection (List) is returned by the SerDe. I think it's better to work with Java Collections instead of native arrays and if we're going to change the return value, this is an appropriate time to introduce such a change, i.e., in a major (4.0) release. was (Author: belugabehr): All unit tests are passing [~bslim] [~kgyrtkirk]. Please consider this patch for inclusion into the project. I understand there is some hesitation regarding the change in return type. Previous a native array was returned and now (with this patch) a Collection (List) is returned by the SerDe. I think it's better to work with Java Collections instead of native arrays and if we're going to change the return value at all, this is an appropriate time to introduce such a change, i.e., in a major (4.0) release. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778846#comment-16778846 ] BELUGA BEHR edited comment on HIVE-21240 at 2/27/19 3:44 AM: - All unit tests are passing [~bslim] [~kgyrtkirk]. Please consider this patch for inclusion into the project. I understand there is some hesitation regarding the change in return type. Previous a native array was returned and now (with this patch) a Collection (List) is returned by the SerDe. I think it's better to work with Java Collections instead of native arrays and if we're going to change the return value at all, this is an appropriate time to introduce such a change, i.e., in a major (4.0) release. was (Author: belugabehr): All unit tests are passing [~bslim] [~kgyrtkirk]. Please consider this patch for inclusion into the project. I understand there is some hesitation regarding the change in return type. Previous a native array was returned and now a Collection (List) is returned by the SerDe. I think it's better to work with Java Collections instead of native arrays and if we're going to change the return value at all, this is an appropriate time to introduce such a change, i.e., in a major (4.0) release. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516 ] BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:54 AM: - [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the {{JsonSerDe}} on my local branch which fixes the {{timestamp with local timezone}} stuff. As you can see, I have populated the values with the timestamp values. Are you expecting all values to be lost (null)? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? was (Author: belugabehr): [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the {{JsonSerDe}} on my local branch which fixes the {{timestamp with local timezone}} stuff. As you can see, I have populated the values with the timestamp values. Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516 ] BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:53 AM: - [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the {{JsonSerDe}} on my local branch which fixes the {{timestamp with local timezone}} stuff. As you can see, I have populated the values with the timestamp values? Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? was (Author: belugabehr): [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the work for {{JsonSerDe}} on my local branch. As you can see, I have populated the values with the timestamp values? Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516 ] BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:53 AM: - [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the {{JsonSerDe}} on my local branch which fixes the {{timestamp with local timezone}} stuff. As you can see, I have populated the values with the timestamp values. Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? was (Author: belugabehr): [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the {{JsonSerDe}} on my local branch which fixes the {{timestamp with local timezone}} stuff. As you can see, I have populated the values with the timestamp values? Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777102#comment-16777102 ] BELUGA BEHR edited comment on HIVE-21240 at 2/25/19 5:24 PM: - [~kgyrtkirk] Thanks! #1 I'm not sure I understand the first request. Are you talking specifically about the HCat code? Are there missing unit tests here? Is that why it passes even though the data types have been changed? As I see it the native arrays are all transformed into Java Collections: {code:java|title=HCat JsonSerDe} List fatRow = fatLand((Object[]) row); return new DefaultHCatRecord(fatRow); ... return Arrays.asList(ArrayUtils.toObject((int[]) arr)); {code} So, the JSON SerDe should just create Java Collections from the get-go instead of having to transform it later. #2 I noted that the Kafka_Handler Q-Test fails locally on trunk as well. I searched across JIRA and see this test fails across many places. I'm not suggesting that there be an exception to the "all green" policy, simply that I need help with investigating the cause as I believe it is outside the scope of this one JIRA. #3 I don't think there's much value in going back and changing the code and testing it. These proposed changes are not about making the SerDe faster, I just want to put out there that there isn't a huge regression. If it's a bit quicker, than that's an added bonus. was (Author: belugabehr): [~kgyrtkirk] Thanks! #1 I'm not sure I understand the first request. Are you talking specifically about the HCat code? Are there missing unit tests here? Is that why it passes even though the data types have been changed? As I see it the native arrays are all transformed into Java Collections: {code:java|title=HCat JsonSerDe} List fatRow = fatLand((Object[]) row); return new DefaultHCatRecord(fatRow); ... return Arrays.asList(ArrayUtils.toObject((int[]) arr)); {code} So, the JSON SerDe should just create Java Collections from the get-go instead of having to transform it later. #2 I noted that the Kafka_Handler Q-Test fails locally on trunk as well. I searched across JIRA and see this test fails across many places. I can keep looking at it though. #3 I don't think there's much value in going back and changing the code and testing it. These proposed changes are not about making the SerDe faster, I just want to put out there that there isn't a huge regression. If it's a bit quicker, than that's an added bonus. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775543#comment-16775543 ] BELUGA BEHR edited comment on HIVE-21240 at 2/22/19 7:41 PM: - OK, I figured out the issue. I am running this SerDe in CDH 6.1 (based on Hive 2.2) and it fails with a version-mismatch issue when handling dates. This patch contains a JsonSerDe which is faster (read) and more feature rich than the existing JsonSerde. Please accept the latest patch for inclusion into the project. was (Author: belugabehr): OK, I figured out the issue. I am running this SerDe in CDH 6.1 and it fails with a version-mismatch issue when handling dates. This patch contains a JsonSerDe which is faster (read) and more feature rich than the existing JsonSerde. Please accept the latest patch for inclusion into the project. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)