[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778846#comment-16778846
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/27/19 3:44 AM:
-

All unit tests are passing [~bslim] [~kgyrtkirk].  Please consider this patch 
for inclusion into the project.  I understand there is some hesitation 
regarding the change in return type.  Previous a native array was returned and 
now (with this patch) a Collection (List) is returned by the SerDe.  I think 
it's better to work with Java Collections instead of native arrays and if we're 
going to change the return value, this is an appropriate time to introduce such 
a change, i.e., in a major (4.0) release.


was (Author: belugabehr):
All unit tests are passing [~bslim] [~kgyrtkirk].  Please consider this patch 
for inclusion into the project.  I understand there is some hesitation 
regarding the change in return type.  Previous a native array was returned and 
now (with this patch) a Collection (List) is returned by the SerDe.  I think 
it's better to work with Java Collections instead of native arrays and if we're 
going to change the return value at all, this is an appropriate time to 
introduce such a change, i.e., in a major (4.0) release.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778846#comment-16778846
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/27/19 3:44 AM:
-

All unit tests are passing [~bslim] [~kgyrtkirk].  Please consider this patch 
for inclusion into the project.  I understand there is some hesitation 
regarding the change in return type.  Previous a native array was returned and 
now (with this patch) a Collection (List) is returned by the SerDe.  I think 
it's better to work with Java Collections instead of native arrays and if we're 
going to change the return value at all, this is an appropriate time to 
introduce such a change, i.e., in a major (4.0) release.


was (Author: belugabehr):
All unit tests are passing [~bslim] [~kgyrtkirk].  Please consider this patch 
for inclusion into the project.  I understand there is some hesitation 
regarding the change in return type.  Previous a native array was returned and 
now a Collection (List) is returned by the SerDe.  I think it's better to work 
with Java Collections instead of native arrays and if we're going to change the 
return value at all, this is an appropriate time to introduce such a change, 
i.e., in a major (4.0) release.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:54 AM:
-

[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the {{JsonSerDe}} on 
my local branch which fixes the {{timestamp with local timezone}} stuff.  As 
you can see, I have populated the values with the timestamp values.  Are you 
expecting all values to be lost (null)?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?


was (Author: belugabehr):
[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the {{JsonSerDe}} on 
my local branch which fixes the {{timestamp with local timezone}} stuff.  As 
you can see, I have populated the values with the timestamp values.  Are you 
expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:53 AM:
-

[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the {{JsonSerDe}} on 
my local branch which fixes the {{timestamp with local timezone}} stuff.  As 
you can see, I have populated the values with the timestamp values?  Are you 
expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?


was (Author: belugabehr):
[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the work for 
{{JsonSerDe}} on my local branch.  As you can see, I have populated the values 
with the timestamp values?  Are you expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:53 AM:
-

[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the {{JsonSerDe}} on 
my local branch which fixes the {{timestamp with local timezone}} stuff.  As 
you can see, I have populated the values with the timestamp values.  Are you 
expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?


was (Author: belugabehr):
[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the {{JsonSerDe}} on 
my local branch which fixes the {{timestamp with local timezone}} stuff.  As 
you can see, I have populated the values with the timestamp values?  Are you 
expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777102#comment-16777102
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/25/19 5:24 PM:
-

[~kgyrtkirk] Thanks!

#1 I'm not sure I understand the first request.  Are you talking specifically 
about the HCat code?  Are there missing unit tests here?  Is that why it passes 
even though the data types have been changed?  As I see it the native arrays 
are all transformed into Java Collections:

{code:java|title=HCat JsonSerDe}
  List fatRow = fatLand((Object[]) row);
  return new DefaultHCatRecord(fatRow);
...
return Arrays.asList(ArrayUtils.toObject((int[]) arr));
{code}

So, the JSON SerDe should just create Java Collections from the get-go instead 
of having to transform it later.

#2 I noted that the Kafka_Handler Q-Test fails locally on trunk as well.  I 
searched across JIRA and see this test fails across many places. I'm not 
suggesting that there be an exception to the "all green" policy, simply that I 
need help with investigating the cause as I believe it is outside the scope of 
this one JIRA.

#3 I don't think there's much value in going back and changing the code and 
testing it.  These proposed changes are not about making the SerDe faster, I 
just want to put out there that there isn't a huge regression.  If it's a bit 
quicker, than that's an added bonus.


was (Author: belugabehr):
[~kgyrtkirk] Thanks!

#1 I'm not sure I understand the first request.  Are you talking specifically 
about the HCat code?  Are there missing unit tests here?  Is that why it passes 
even though the data types have been changed?  As I see it the native arrays 
are all transformed into Java Collections:

{code:java|title=HCat JsonSerDe}
  List fatRow = fatLand((Object[]) row);
  return new DefaultHCatRecord(fatRow);
...
return Arrays.asList(ArrayUtils.toObject((int[]) arr));
{code}

So, the JSON SerDe should just create Java Collections from the get-go instead 
of having to transform it later.

#2 I noted that the Kafka_Handler Q-Test fails locally on trunk as well.  I 
searched across JIRA and see this test fails across many places. I can keep 
looking at it though.

#3 I don't think there's much value in going back and changing the code and 
testing it.  These proposed changes are not about making the SerDe faster, I 
just want to put out there that there isn't a huge regression.  If it's a bit 
quicker, than that's an added bonus.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775543#comment-16775543
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/22/19 7:41 PM:
-

OK, I figured out the issue.  I am running this SerDe in CDH 6.1 (based on Hive 
2.2) and it fails with a version-mismatch issue when handling dates.

 

This patch contains a JsonSerDe which is faster (read) and more feature rich 
than the existing JsonSerde.  Please accept the latest patch for inclusion into 
the project.


was (Author: belugabehr):
OK, I figured out the issue.  I am running this SerDe in CDH 6.1 and it fails 
with a version-mismatch issue when handling dates.

 

This patch contains a JsonSerDe which is faster (read) and more feature rich 
than the existing JsonSerde.  Please accept the latest patch for inclusion into 
the project.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)