[jira] [Updated] (HIVE-19844) Make CSV SerDe First-Class SerDe

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-19844:
---
Description: 
According to the [Hive SerDe 
Docs|https://cwiki.apache.org/confluence/display/Hive/CSV+Serde], there are 
some extras steps involved in getting the CSV SerDe working with Hive.

{code}
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = "\t",
   "quoteChar" = "'",
   "escapeChar"= "\\"
)  
STORED AS TEXTFILE;
{code}

I would like to propose that we move this SerDe into first-class status:

{{STORED AS CSVFILE}}
{{STORED AS TSVFILE}}

The user should have to perform no additional steps to use this SerDe.

  was:
According to the [Hive SerDe 
Docs|https://cwiki.apache.org/confluence/display/Hive/CSV+Serde], there are 
some extras steps involved in getting the CSV SerDe working with Hive.

{code}
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = "\t",
   "quoteChar" = "'",
   "escapeChar"= "\\"
)  
STORED AS TEXTFILE;
{code}

I would like to propose that we move this SerDe into first-class status:

{{STORED AS TEXT_CSV}}
{{STORED AS TEXT_TSV}}

The user should have to perform no additional steps to use this SerDe.


> Make CSV SerDe First-Class SerDe
> 
>
> Key: HIVE-19844
> URL: https://issues.apache.org/jira/browse/HIVE-19844
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2, Serializers/Deserializers
>Affects Versions: 3.0.0, 2.3.2, 4.0.0
>Reporter: BELUGA BEHR
>Priority: Major
>
> According to the [Hive SerDe 
> Docs|https://cwiki.apache.org/confluence/display/Hive/CSV+Serde], there are 
> some extras steps involved in getting the CSV SerDe working with Hive.
> {code}
> CREATE TABLE my_table(a string, b string, ...)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> WITH SERDEPROPERTIES (
>"separatorChar" = "\t",
>"quoteChar" = "'",
>"escapeChar"= "\\"
> )  
> STORED AS TEXTFILE;
> {code}
> I would like to propose that we move this SerDe into first-class status:
> {{STORED AS CSVFILE}}
> {{STORED AS TSVFILE}}
> The user should have to perform no additional steps to use this SerDe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-19844) Make CSV SerDe First-Class SerDe

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-19844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned HIVE-19844:
--

Assignee: (was: anand)

> Make CSV SerDe First-Class SerDe
> 
>
> Key: HIVE-19844
> URL: https://issues.apache.org/jira/browse/HIVE-19844
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2, Serializers/Deserializers
>Affects Versions: 3.0.0, 2.3.2, 4.0.0
>Reporter: BELUGA BEHR
>Priority: Major
>
> According to the [Hive SerDe 
> Docs|https://cwiki.apache.org/confluence/display/Hive/CSV+Serde], there are 
> some extras steps involved in getting the CSV SerDe working with Hive.
> {code}
> CREATE TABLE my_table(a string, b string, ...)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> WITH SERDEPROPERTIES (
>"separatorChar" = "\t",
>"quoteChar" = "'",
>"escapeChar"= "\\"
> )  
> STORED AS TEXTFILE;
> {code}
> I would like to propose that we move this SerDe into first-class status:
> {{STORED AS TEXT_CSV}}
> {{STORED AS TEXT_TSV}}
> The user should have to perform no additional steps to use this SerDe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21372) Use Apache Commons IO To Read Stream To String

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21372:
---
Attachment: HIVE-21372.1.patch

> Use Apache Commons IO To Read Stream To String
> --
>
> Key: HIVE-21372
> URL: https://issues.apache.org/jira/browse/HIVE-21372
> Project: Hive
>  Issue Type: Improvement
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Fix For: 4.0.0
>
> Attachments: HIVE-21372.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21372) Use Apache Commons IO To Read Stream To String

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned HIVE-21372:
--

Assignee: BELUGA BEHR

> Use Apache Commons IO To Read Stream To String
> --
>
> Key: HIVE-21372
> URL: https://issues.apache.org/jira/browse/HIVE-21372
> Project: Hive
>  Issue Type: Improvement
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Fix For: 4.0.0
>
> Attachments: HIVE-21372.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21372) Use Apache Commons IO To Read Stream To String

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21372:
---
Status: Patch Available  (was: Open)

> Use Apache Commons IO To Read Stream To String
> --
>
> Key: HIVE-21372
> URL: https://issues.apache.org/jira/browse/HIVE-21372
> Project: Hive
>  Issue Type: Improvement
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Fix For: 4.0.0
>
> Attachments: HIVE-21372.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21371) Make NonSyncByteArrayOutputStream Overflow Conscious

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned HIVE-21371:
--

Assignee: BELUGA BEHR

> Make NonSyncByteArrayOutputStream Overflow Conscious 
> -
>
> Key: HIVE-21371
> URL: https://issues.apache.org/jira/browse/HIVE-21371
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21371.1.patch
>
>
> {code:java|title=NonSyncByteArrayOutputStream}
>   private int enLargeBuffer(int increment) {
> int temp = count + increment;
> int newLen = temp;
> if (temp > buf.length) {
>   if ((buf.length << 1) > temp) {
> newLen = buf.length << 1;
>   }
>   byte newbuf[] = new byte[newLen];
>   System.arraycopy(buf, 0, newbuf, 0, count);
>   buf = newbuf;
> }
> return newLen;
>   }
> {code}
> This will fail if the array is 2GB or larger because it will double the size 
> every time without consideration for the 4GB limit on arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21371) Make NonSyncByteArrayOutputStream Overflow Conscious

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21371:
---
Status: Patch Available  (was: Open)

> Make NonSyncByteArrayOutputStream Overflow Conscious 
> -
>
> Key: HIVE-21371
> URL: https://issues.apache.org/jira/browse/HIVE-21371
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21371.1.patch
>
>
> {code:java|title=NonSyncByteArrayOutputStream}
>   private int enLargeBuffer(int increment) {
> int temp = count + increment;
> int newLen = temp;
> if (temp > buf.length) {
>   if ((buf.length << 1) > temp) {
> newLen = buf.length << 1;
>   }
>   byte newbuf[] = new byte[newLen];
>   System.arraycopy(buf, 0, newbuf, 0, count);
>   buf = newbuf;
> }
> return newLen;
>   }
> {code}
> This will fail if the array is 2GB or larger because it will double the size 
> every time without consideration for the 4GB limit on arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21371) Make NonSyncByteArrayOutputStream Overflow Conscious

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21371:
---
Attachment: HIVE-21371.1.patch

> Make NonSyncByteArrayOutputStream Overflow Conscious 
> -
>
> Key: HIVE-21371
> URL: https://issues.apache.org/jira/browse/HIVE-21371
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21371.1.patch
>
>
> {code:java|title=NonSyncByteArrayOutputStream}
>   private int enLargeBuffer(int increment) {
> int temp = count + increment;
> int newLen = temp;
> if (temp > buf.length) {
>   if ((buf.length << 1) > temp) {
> newLen = buf.length << 1;
>   }
>   byte newbuf[] = new byte[newLen];
>   System.arraycopy(buf, 0, newbuf, 0, count);
>   buf = newbuf;
> }
> return newLen;
>   }
> {code}
> This will fail if the array is 2GB or larger because it will double the size 
> every time without consideration for the 4GB limit on arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21370:
---
Description: Empty lines lead to duplicated records on output.

> JsonSerDe cannot handle json file with empty lines - Branch 3
> -
>
> Key: HIVE-21370
> URL: https://issues.apache.org/jira/browse/HIVE-21370
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HIVE-21370.1.patch
>
>
> Empty lines lead to duplicated records on output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3

2019-03-01 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781992#comment-16781992
 ] 

BELUGA BEHR commented on HIVE-21370:


I'm not sure how to get it to YETUS against the 3.x branch.  Any assistance 
would be appreciated.

> JsonSerDe cannot handle json file with empty lines - Branch 3
> -
>
> Key: HIVE-21370
> URL: https://issues.apache.org/jira/browse/HIVE-21370
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HIVE-21370.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21370:
---
Issue Type: Bug  (was: Improvement)

> JsonSerDe cannot handle json file with empty lines - Branch 3
> -
>
> Key: HIVE-21370
> URL: https://issues.apache.org/jira/browse/HIVE-21370
> Project: Hive
>  Issue Type: Bug
>  Components: Serializers/Deserializers
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HIVE-21370.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21370:
---
Priority: Critical  (was: Major)

> JsonSerDe cannot handle json file with empty lines - Branch 3
> -
>
> Key: HIVE-21370
> URL: https://issues.apache.org/jira/browse/HIVE-21370
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: HIVE-21370.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21370:
---
Status: Patch Available  (was: Open)

> JsonSerDe cannot handle json file with empty lines - Branch 3
> -
>
> Key: HIVE-21370
> URL: https://issues.apache.org/jira/browse/HIVE-21370
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.0, 3.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HIVE-21370.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3

2019-03-01 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781984#comment-16781984
 ] 

BELUGA BEHR commented on HIVE-21370:


This issue is already fixed in 4.0 with other changes.  Just fixing the 3.x 
branch with this patch.

> JsonSerDe cannot handle json file with empty lines - Branch 3
> -
>
> Key: HIVE-21370
> URL: https://issues.apache.org/jira/browse/HIVE-21370
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HIVE-21370.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21370:
---
Attachment: HIVE-21370.1.patch

> JsonSerDe cannot handle json file with empty lines - Branch 3
> -
>
> Key: HIVE-21370
> URL: https://issues.apache.org/jira/browse/HIVE-21370
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HIVE-21370.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned HIVE-21370:
--

Assignee: BELUGA BEHR

> JsonSerDe cannot handle json file with empty lines - Branch 3
> -
>
> Key: HIVE-21370
> URL: https://issues.apache.org/jira/browse/HIVE-21370
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HIVE-21370.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21246) Un-bury DelimitedJSONSerDe from PlanUtils.java

2019-03-01 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21246:
---
Attachment: (was: HIVE-21246.1.patch)

> Un-bury DelimitedJSONSerDe from PlanUtils.java
> --
>
> Key: HIVE-21246
> URL: https://issues.apache.org/jira/browse/HIVE-21246
> Project: Hive
>  Issue Type: Improvement
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: HIVE-21246.1.patch, HIVE-21246.1.patch
>
>
> Ultimately, I'd like to get rid of 
> {{org.apache.hadoop.hive.serde2.DelimitedJSONSerDe}}, but for now, trying to 
> make it easier to get rid of later.  It's currently buried in 
> {{PlanUtils.java}}.
> A SerDe and a boolean flag gets passed into these methods.  If the flag is 
> set to true, the specified SerDe is overwritten and assigned to 
> {{DelimitedJSONSerDe}}.  This is not documented anywhere and it's a weird 
> thing to do, just pass in the required SerDe from the start instead of 
> sending the wrong SerDe and a flag to overwrite it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21354) Lock The Entire Table If Majority Of Partitions Are Locked

2019-02-28 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781159#comment-16781159
 ] 

BELUGA BEHR commented on HIVE-21354:


Thanks for the input [~gopalv].

What about just a simple {{SELECT * FROM TABLE WHERE (non-partitioned-value)=?}}

> Lock The Entire Table If Majority Of Partitions Are Locked
> --
>
> Key: HIVE-21354
> URL: https://issues.apache.org/jira/browse/HIVE-21354
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Priority: Major
>
> One of the bottlenecks of any Hive query is the ZooKeeper locking mechanism.  
> When a Hive query interacts with a table which has a lot of partitions, this 
> may put a lot of stress on the ZK system.
> Please add a heuristic that works like this:
> # Count the number of partitions that a query is required to lock
> # Obtain the total number of partitions in the table
> # If the number of partitions accessed by the query is greater than or equal 
> to half the total number of partitions, simply create one ZNode lock at the 
> table level.
> This would improve performance of many queries, but in particular, a {{select 
> count(1) from table}} ... or ... {{select * from table limit 5}} where the 
> table has many partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21354) Lock The Entire Table If Majority Of Partitions Are Locked

2019-02-28 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780821#comment-16780821
 ] 

BELUGA BEHR commented on HIVE-21354:


bq. This is only true when you disable ACID

So does this only apply to ORC tables?  Does ACIDv2 apply to Parquet, Avro, 
JSON, etc?

> Lock The Entire Table If Majority Of Partitions Are Locked
> --
>
> Key: HIVE-21354
> URL: https://issues.apache.org/jira/browse/HIVE-21354
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Priority: Major
>
> One of the bottlenecks of any Hive query is the ZooKeeper locking mechanism.  
> When a Hive query interacts with a table which has a lot of partitions, this 
> may put a lot of stress on the ZK system.
> Please add a heuristic that works like this:
> # Count the number of partitions that a query is required to lock
> # Obtain the total number of partitions in the table
> # If the number of partitions accessed by the query is greater than or equal 
> to half the total number of partitions, simply create one ZNode lock at the 
> table level.
> This would improve performance of many queries, but in particular, a {{select 
> count(1) from table}} ... or ... {{select * from table limit 5}} where the 
> table has many partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-28 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780694#comment-16780694
 ] 

BELUGA BEHR commented on HIVE-21240:


Hey Team,

Any other comments, questions, concerns?

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21356) Upgrade Jackson to 2.9.8

2019-02-28 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned HIVE-21356:
--

Assignee: BELUGA BEHR

> Upgrade Jackson to 2.9.8
> 
>
> Key: HIVE-21356
> URL: https://issues.apache.org/jira/browse/HIVE-21356
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Fix For: 4.0.0
>
>
> Currently at:
> {code}
> 2.9.5
> {code}
> Upgrade to 2.9.8 - contains some improvements for processing Base64 data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21352) Drop INDEX from 3.0 Schema

2019-02-28 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21352:
---
Description: 
We dropped support for Hive indexes starting in 3.0, however there are still 
tables in Metastore to support it.  Please remove.

https://github.com/apache/hive/blob/master/metastore/scripts/upgrade/mysql/hive-schema-2.3.0.mysql.sql#L147-L165

  was:We dropped support for Hive indexes starting in 3.0, however there are 
still tables in Metastore to support it.  Please remove.


> Drop INDEX from 3.0 Schema
> --
>
> Key: HIVE-21352
> URL: https://issues.apache.org/jira/browse/HIVE-21352
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Priority: Minor
>
> We dropped support for Hive indexes starting in 3.0, however there are still 
> tables in Metastore to support it.  Please remove.
> https://github.com/apache/hive/blob/master/metastore/scripts/upgrade/mysql/hive-schema-2.3.0.mysql.sql#L147-L165



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21347) Store Partition Count in TBLS

2019-02-28 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21347:
---
Description: Please store a count of the number of partitions each table 
has in the {{TBLS}} table.  This will allow very quick lookups for tables with 
many partitions.  (was: Please store a count of the number of partitions each 
table has in the ```TBLS``` table.  This will allow very quick lookups for 
tables with many partitions.)

> Store Partition Count in TBLS
> -
>
> Key: HIVE-21347
> URL: https://issues.apache.org/jira/browse/HIVE-21347
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Priority: Major
>
> Please store a count of the number of partitions each table has in the 
> {{TBLS}} table.  This will allow very quick lookups for tables with many 
> partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21210) CombineHiveInputFormat Thread Pool Sizing

2019-02-27 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779954#comment-16779954
 ] 

BELUGA BEHR commented on HIVE-21210:


[~zchovan] Can I get your thoughts on this one? :)

> CombineHiveInputFormat Thread Pool Sizing
> -
>
> Key: HIVE-21210
> URL: https://issues.apache.org/jira/browse/HIVE-21210
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21210.1.patch, HIVE-21210.2.patch, 
> HIVE-21210.3.patch, HIVE-21210.4.patch, HIVE-21210.5.patch, 
> HIVE-21210.6.patch, HIVE-21210.7.patch, HIVE-21210.8.patch
>
>
> Threadpools.
> Hive uses threadpools in several different places and each implementation is 
> a little different and requires different configurations. I think that Hive 
> needs to reign in and standardize the way that threadpools are used and 
> threadpools should scale automatically without manual configuration. At any 
> given time, there are many hundreds of threads running in the HS2 as the 
> number of simultaneous connections increases and they surely cause contention 
> with one-another.
> Here is an example:
> {code:java|title=CombineHiveInputFormat.java}
>   // max number of threads we can use to check non-combinable paths
>   private static final int MAX_CHECK_NONCOMBINABLE_THREAD_NUM = 50;
>   private static final int DEFAULT_NUM_PATH_PER_THREAD = 100;
> {code}
> When building the splits for a MR job, there are up to 50 threads running per 
> query and there is not much scaling here, it's simply 1 thread : 100 files 
> ratio.  This implies that to process 5000 files, there are 50 threads, after 
> that, 50 threads are still used. Many Hive jobs these days involve more than 
> 5000 files so it's not scaling well on bigger sizes.
> This is not configurable (even manually), it doesn't change when the hardware 
> specs increase, and 50 threads seems like a lot when a service must support 
> up to 80 connections:
> [https://www.cloudera.com/documentation/enterprise/5/latest/topics/admin_hive_tuning.html]
> Not to mention, I have never seen a scenario where HS2 is running on a host 
> all by itself and has the entire system dedicated to it. Therefore it should 
> be more friendly and spin up fewer threads.
> I am attaching a patch here that provides a few features:
>  * Common module that produces {{ExecutorService}} which caps the number of 
> threads it spins up at the number of processors a host has. Keep in mind that 
> a class may submit as much work units ({{Callables}} as they would like, but 
> the number of threads in the pool is capped.
>  * Common module for partitioning work. That is, allow for a generic 
> framework for dividing work into partitions (i.e. batches)
>  * Modify {{CombineHiveInputFormat}} to take advantage of both modules, 
> performing its same duties in a more Java OO way that is currently implemented
>  * Add a partitioning (batching) implementation that enforces partitioning of 
> a {{Collection}} based on the natural log of the {{Collection}} size so that 
> it scales more slowly than a simple 1:100 ratio.
>  * Simplify unit test code for {{CombineHiveInputFormat}}
> My hope is to introduce these tools to {{CombineHiveInputFormat}} and then to 
> drop it into other places.  One of the things I will introduce here is a 
> "direct thread" {{ExecutorService}} so that even if there is a configuration 
> for a thread pool to be disabled, it will still use an {{ExecutorService}} so 
> that the project can avoid logic like "if this function is services by a 
> thread pool, use a {{ExecutorService}} (and remember to close it later!) 
> otherwise, create a single thread" so that things like [HIVE-16949] can be 
> avoided in the future.  Everything will just use an {{ExecutorService}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-27 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779501#comment-16779501
 ] 

BELUGA BEHR commented on HIVE-21240:


Just to be clear... Avro already uses a {{List}} as the return type, so I'm 
just bringing JsonSerde into alignment with the rest.

https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java#L143

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-27 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779499#comment-16779499
 ] 

BELUGA BEHR commented on HIVE-21240:


[~bslim]  With a large project like Hive, maintained by many different 
supporters and countless number of additional troubleshooters that dig through 
the code to resolve issues, it is all the more important to adhere to best 
practices.  With few exceptions, everything should be a Java Collection.  
Making smart choices about the actual data structures used (Set, Map, List, 
etc.) is going to yield much more benefit than trying to manipulate primitive 
arrays.  I've never had a Hive user complain that they wished it was 2% faster, 
but I hear all the time about how complicated the product is and how difficult 
it is to troubleshoot.

There are a few books written on the topic which I won't regurgitate here, but 
I think this sums it up well:

https://stackoverflow.com/questions/6100148/collection-interface-vs-arrays



> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778846#comment-16778846
 ] 

BELUGA BEHR commented on HIVE-21240:


All unit tests are passing [~bslim] [~kgyrtkirk].  Please consider this patch 
for inclusion into the project.  I understand there is some hesitation 
regarding the change in return type.  Previous a native array was returned and 
now a Collection (List) is returned by the SerDe.  I think it's better to work 
with Java Collections instead of native arrays and if we're going to change the 
return value at all, this is an appropriate time to introduce such a change, 
i.e., in a major (4.0) release.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778846#comment-16778846
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/27/19 3:44 AM:
-

All unit tests are passing [~bslim] [~kgyrtkirk].  Please consider this patch 
for inclusion into the project.  I understand there is some hesitation 
regarding the change in return type.  Previous a native array was returned and 
now (with this patch) a Collection (List) is returned by the SerDe.  I think 
it's better to work with Java Collections instead of native arrays and if we're 
going to change the return value, this is an appropriate time to introduce such 
a change, i.e., in a major (4.0) release.


was (Author: belugabehr):
All unit tests are passing [~bslim] [~kgyrtkirk].  Please consider this patch 
for inclusion into the project.  I understand there is some hesitation 
regarding the change in return type.  Previous a native array was returned and 
now (with this patch) a Collection (List) is returned by the SerDe.  I think 
it's better to work with Java Collections instead of native arrays and if we're 
going to change the return value at all, this is an appropriate time to 
introduce such a change, i.e., in a major (4.0) release.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778846#comment-16778846
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/27/19 3:44 AM:
-

All unit tests are passing [~bslim] [~kgyrtkirk].  Please consider this patch 
for inclusion into the project.  I understand there is some hesitation 
regarding the change in return type.  Previous a native array was returned and 
now (with this patch) a Collection (List) is returned by the SerDe.  I think 
it's better to work with Java Collections instead of native arrays and if we're 
going to change the return value at all, this is an appropriate time to 
introduce such a change, i.e., in a major (4.0) release.


was (Author: belugabehr):
All unit tests are passing [~bslim] [~kgyrtkirk].  Please consider this patch 
for inclusion into the project.  I understand there is some hesitation 
regarding the change in return type.  Previous a native array was returned and 
now a Collection (List) is returned by the SerDe.  I think it's better to work 
with Java Collections instead of native arrays and if we're going to change the 
return value at all, this is an appropriate time to introduce such a change, 
i.e., in a major (4.0) release.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: HIVE-21240.11.patch

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Patch Available  (was: Open)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Open  (was: Patch Available)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778489#comment-16778489
 ] 

BELUGA BEHR commented on HIVE-21240:


[~bslim] Can you drop the test for {{kafka_table_2}} since it is no longer 
testing the 'basic implementation' as is described?

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Patch Available  (was: Open)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: HIVE-21240.11.patch

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Open  (was: Patch Available)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Patch Available  (was: Open)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.9.patch, HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Open  (was: Patch Available)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.9.patch, HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-26 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: HIVE-21240.11.patch

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.9.patch, HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: HIVE-21240.11.patch

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Patch Available  (was: Open)

Posting a patch with all the changes (serde and kafka) and we'll see what we 
get.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Open  (was: Patch Available)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:54 AM:
-

[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the {{JsonSerDe}} on 
my local branch which fixes the {{timestamp with local timezone}} stuff.  As 
you can see, I have populated the values with the timestamp values.  Are you 
expecting all values to be lost (null)?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?


was (Author: belugabehr):
[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the {{JsonSerDe}} on 
my local branch which fixes the {{timestamp with local timezone}} stuff.  As 
you can see, I have populated the values with the timestamp values.  Are you 
expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:53 AM:
-

[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the {{JsonSerDe}} on 
my local branch which fixes the {{timestamp with local timezone}} stuff.  As 
you can see, I have populated the values with the timestamp values?  Are you 
expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?


was (Author: belugabehr):
[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the work for 
{{JsonSerDe}} on my local branch.  As you can see, I have populated the values 
with the timestamp values?  Are you expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:53 AM:
-

[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the {{JsonSerDe}} on 
my local branch which fixes the {{timestamp with local timezone}} stuff.  As 
you can see, I have populated the values with the timestamp values.  Are you 
expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?


was (Author: belugabehr):
[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the {{JsonSerDe}} on 
my local branch which fixes the {{timestamp with local timezone}} stuff.  As 
you can see, I have populated the values with the timestamp values?  Are you 
expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516
 ] 

BELUGA BEHR commented on HIVE-21240:


[~bslim]

 

Thanks for the update.

Here is the diff I'm looking at: [^kafka_storage_handler.diff]

To pass the test with this diff, it requires that you use the work for 
{{JsonSerDe}} on my local branch.  As you can see, I have populated the values 
with the timestamp values?  Are you expecting all values to be lost?

 

Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we 
move it to the 'test' directory so that it's not shipping with the actual 
product.  If it's not meant for production, we don't want to make it available, 
because there's always that one person that will find it and use it.  However, 
the Hive {{JsonSerde}} is already the default in the Kafka project, so what is 
the LOE to use the one included with Hive than to use this test implementation?

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: kafka_storage_handler.diff

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, 
> kafka_storage_handler.diff
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777446#comment-16777446
 ] 

BELUGA BEHR commented on HIVE-21240:


[~kgyrtkirk] OK. I finally figured it out.

cc: [~bslim]

There are a few things going on:

The Kafka Storage Handler q-test is incorrect. There are several tables that 
define a {{timestamp}} or a {{timestamp with local time zone}} column. And for 
every q-test result, the q-test expects a {{NULL}} value in the column. 
However, I do believe this is incorrect and therefore this q-test is testing 
for a wrong behavior. There is timestamp data in the test data set and there 
should be values present, not {{NULL}} values.

I believe that the current JSON SerDe is unable to handle the timestamp Strings 
in the test data set and instead of throwing an Exception, it swallows the bad 
value and returns a null value. Check it out 
[here|https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java#L1298].
 I do not think this is a good behavior. Users may silently lose data and there 
is no way to configure it otherwise. However, even if it was configurable, the 
default should be to throw an Exception and to stop the processing, not to lose 
data.

Additionally, there is a q-test that states:
{quote}using basic implementation of flat json probably to be removed
{quote}
This is achieved by not specifying an explicit table SerDe in the q-test. The 
'basic implementation' is probably in reference to the class 
{{KafkaJsonSerDe}}. However, this class is not used anymore as far I can tell 
and is no longer the default SerDe for the Kafka Storage Handler. The Hive 
{{JsonSerde}} is [the default 
serde|https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaTableProperties.java#L38].
 One thing you'll notice with this particular q-test is that there is also no 
{{timestamp.formats}} defined for the table. This is because {{KafkaJsonSerDe}} 
handles [ISO-8601 
format|https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaJsonSerDe.java#L230]
 (and only that format) so it does not need to explicitly specify the format.  
All of the data in this test are formatted as ISO-8601 and if you look at all 
the other tables in the test, they all pass in an ISO-8601 format string. It is 
required to pass this format string because the Hive {{JsonSerde}} does not 
handle that format by default. Without the {{timestamp.formats}} defined on 
this table, there is no way that the current {{JsonSerde}} is handling this 
data correctly.  It again demonstrates that the current {{JsonSerde}} behavior 
is to swallow the exception, and return NULL.

[https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/kafka_storage_handler.q]
 
[https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/kafka/kafka_storage_handler.q.out]

 

What triggered these q-test failures for this jira is three fold:
 # This JIRA's implementation was in some cases producing valid timestamps 
instead of NULL
 # This JIRA's implementation was throwing an exception because it was unable 
to parse the timestamp String without the format string defined
 # This implementation was in some cases throwing an exception because it did 
not support {{timestamp with local time zone}}.

 
 Moving forward, I would like to:
 # Add to this implementation the ability to process {{timestamp with local 
time zone}}. This is currently not fully supported in the current {{JsonSerde}} 
implementation. Only 
[timestamp|https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/json/HiveJsonStructReader.java#L355]
 is supported.  It works in trunk because there's a conversion process that 
happens immediately prior to this _switch_ statement that is doing the work.
 # Update the {{kafka_storage_handler}} q-test to check for the correct 
timestamp values
 # Remove {{KafkaJsonSerDe}} serde
 # Remove the "basic implementation of flat json" test from the q-test

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  

[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777129#comment-16777129
 ] 

BELUGA BEHR commented on HIVE-21240:


[~kgyrtkirk] Figured out my local failure for this UT.  Will investigate 
further this failure.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21317) Unit Test kafka_storage_handler Is Failing Regularly

2019-02-25 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21317:
---
Priority: Minor  (was: Critical)

> Unit Test kafka_storage_handler Is Failing Regularly
> 
>
> Key: HIVE-21317
> URL: https://issues.apache.org/jira/browse/HIVE-21317
> Project: Hive
>  Issue Type: Task
>Affects Versions: 4.0.0
>Reporter: BELUGA BEHR
>Priority: Minor
>
> {code}
> org.apache.hadoop.hive.cli.TestMiniHiveKafkaCliDriver.testCliDriver[kafka_storage_handler]
>  (batchId=275)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-21317) Unit Test kafka_storage_handler Is Failing Regularly

2019-02-25 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR resolved HIVE-21317.

Resolution: Not A Problem

> Unit Test kafka_storage_handler Is Failing Regularly
> 
>
> Key: HIVE-21317
> URL: https://issues.apache.org/jira/browse/HIVE-21317
> Project: Hive
>  Issue Type: Task
>Affects Versions: 4.0.0
>Reporter: BELUGA BEHR
>Priority: Critical
>
> {code}
> org.apache.hadoop.hive.cli.TestMiniHiveKafkaCliDriver.testCliDriver[kafka_storage_handler]
>  (batchId=275)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777102#comment-16777102
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/25/19 5:24 PM:
-

[~kgyrtkirk] Thanks!

#1 I'm not sure I understand the first request.  Are you talking specifically 
about the HCat code?  Are there missing unit tests here?  Is that why it passes 
even though the data types have been changed?  As I see it the native arrays 
are all transformed into Java Collections:

{code:java|title=HCat JsonSerDe}
  List fatRow = fatLand((Object[]) row);
  return new DefaultHCatRecord(fatRow);
...
return Arrays.asList(ArrayUtils.toObject((int[]) arr));
{code}

So, the JSON SerDe should just create Java Collections from the get-go instead 
of having to transform it later.

#2 I noted that the Kafka_Handler Q-Test fails locally on trunk as well.  I 
searched across JIRA and see this test fails across many places. I'm not 
suggesting that there be an exception to the "all green" policy, simply that I 
need help with investigating the cause as I believe it is outside the scope of 
this one JIRA.

#3 I don't think there's much value in going back and changing the code and 
testing it.  These proposed changes are not about making the SerDe faster, I 
just want to put out there that there isn't a huge regression.  If it's a bit 
quicker, than that's an added bonus.


was (Author: belugabehr):
[~kgyrtkirk] Thanks!

#1 I'm not sure I understand the first request.  Are you talking specifically 
about the HCat code?  Are there missing unit tests here?  Is that why it passes 
even though the data types have been changed?  As I see it the native arrays 
are all transformed into Java Collections:

{code:java|title=HCat JsonSerDe}
  List fatRow = fatLand((Object[]) row);
  return new DefaultHCatRecord(fatRow);
...
return Arrays.asList(ArrayUtils.toObject((int[]) arr));
{code}

So, the JSON SerDe should just create Java Collections from the get-go instead 
of having to transform it later.

#2 I noted that the Kafka_Handler Q-Test fails locally on trunk as well.  I 
searched across JIRA and see this test fails across many places. I can keep 
looking at it though.

#3 I don't think there's much value in going back and changing the code and 
testing it.  These proposed changes are not about making the SerDe faster, I 
just want to put out there that there isn't a huge regression.  If it's a bit 
quicker, than that's an added bonus.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777102#comment-16777102
 ] 

BELUGA BEHR commented on HIVE-21240:


[~kgyrtkirk] Thanks!

#1 I'm not sure I understand the first request.  Are you talking specifically 
about the HCat code?  Are there missing unit tests here?  Is that why it passes 
even though the data types have been changed?  As I see it the native arrays 
are all transformed into Java Collections:

{code:java|title=HCat JsonSerDe}
  List fatRow = fatLand((Object[]) row);
  return new DefaultHCatRecord(fatRow);
...
return Arrays.asList(ArrayUtils.toObject((int[]) arr));
{code}

So, the JSON SerDe should just create Java Collections from the get-go instead 
of having to transform it later.

#2 I noted that the Kafka_Handler Q-Test fails locally on trunk as well.  I 
searched across JIRA and see this test fails across many places. I can keep 
looking at it though.

#3 I don't think there's much value in going back and changing the code and 
testing it.  These proposed changes are not about making the SerDe faster, I 
just want to put out there that there isn't a huge regression.  If it's a bit 
quicker, than that's an added bonus.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777012#comment-16777012
 ] 

BELUGA BEHR commented on HIVE-21240:


[~kgyrtkirk] I created [HIVE-21317] to address the one failing unit test.

Can you please review the latest patch?

Thanks!

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776989#comment-16776989
 ] 

BELUGA BEHR commented on HIVE-20079:


[~asinkovits] Thanks for the clarification.  I see now that the current 
implementation just counts columns.  I'm on the same page now.  Thanks.

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21303) Update TextRecordReader

2019-02-25 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21303:
---
Status: Patch Available  (was: Open)

Thanks [~pvary].  So, what I realized just now is that this class is only used 
for unit tests.  I propose moving this class out of the Hive main code base and 
into test.  Patch included.

> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch, HIVE-21303.2.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21303) Update TextRecordReader

2019-02-25 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21303:
---
Attachment: HIVE-21303.2.patch

> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch, HIVE-21303.2.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21303) Update TextRecordReader

2019-02-25 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21303:
---
Status: Open  (was: Patch Available)

> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format

2019-02-25 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776925#comment-16776925
 ] 

BELUGA BEHR commented on HIVE-20079:


This patch is still incorrect.  It's actually producing the same wrong numbers 
as before, though, perhaps a bit more efficiently.

{code}
totalSize += block.getTotalByteSize();
{code}

{{getTotalByteSize()}} is not the same as "rawDataSize".


bq. rawDataSize—Approximate size of data in memory

https://www.cloudera.com/documentation/enterprise/5-15-x/topics/admin_hos_tuning.html

That means that for a single table row with 4 INTs (values: 1,2,3,4) I would 
expect a rawDataSize of (4 bytes x 4 Java ints) = 32 bytes.  However, Parquet 
would report this as 4 bytes because of the way that Parquet packs these 
numbers internal to its implementation.  Hive should look at the row counts and 
multiply it by the row data types.

The {{AbstractSerDe}} class should have code to facilitate all of this like 
{{readNumber()}} {{readString(int bumBytes}}, etc that can be called as each 
row is read.

> Populate more accurate rawDataSize for parquet format
> -
>
> Key: HIVE-20079
> URL: https://issues.apache.org/jira/browse/HIVE-20079
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 2.0.0
>Reporter: Aihua Xu
>Assignee: Antal Sinkovits
>Priority: Major
> Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, 
> HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch
>
>
> Run the following queries and you will see the raw data for the table is 4 
> (that is the number of fields) incorrectly. We need to populate correct data 
> size so data can be split properly.
> {noformat}
> SET hive.stats.autogather=true;
> CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET;
> INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1');
> DESC FORMATTED parquet_stats;
> {noformat}
> {noformat}
> Table Parameters:
>   COLUMN_STATS_ACCURATE   true
>   numFiles1
>   numRows 2
>   rawDataSize 4
>   totalSize   373
>   transient_lastDdlTime   1530660523
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-24 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Patch Available  (was: Open)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-24 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: HIVE-21240.10.patch

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-24 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Open  (was: Patch Available)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, 
> HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, 
> HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-15475) JsonSerDe cannot handle json file with empty lines

2019-02-24 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776298#comment-16776298
 ] 

BELUGA BEHR commented on HIVE-15475:


Nope. OK.  Figured it out.

This issue was inadvertently fixed as part of [HIVE-18545] (Jul 10, 2018).  
Previous to this change, the JSON stuff was handled by 
{{org.apache.hive.hcatalog.data.JsonSerDe}}

The issue was that this class was not handling the provided {{Text}} object 
correctly.  The {{Text}} object has two components to it: an internal array of 
bytes *and* a size that indicates which bytes are to be processed.  Well, 
{{JsonSerde}} was not taking into account the size, so, when a zero-length 
{{Text}} object was submitted, it would still look at the entire internal byte 
array, ignoring the zero size, and produce duplicates where there should be no 
text.

https://github.com/apache/hive/blob/ae008b79b5d52ed6a38875b73025a505725828eb/hcatalog/core/src/main/java/org/apache/hive/hcatalog/data/JsonSerDe.java#L168

> JsonSerDe cannot handle json file with empty lines
> --
>
> Key: HIVE-15475
> URL: https://issues.apache.org/jira/browse/HIVE-15475
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1
>Reporter: pin_zhang
>Priority: Major
>
> 1. start HiveServer2 in apache-hive-1.2.1
> 2 start a beeline connect to hive server2
>   ADD JAR  ADD JAR 
> /home/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar
>  ;
>CREATE external TABLE my_table(a string, b bigint)
> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> STORED AS TEXTFILE
> location 'file:///home/hive/json';
> 3 put a file with more than one new lines at the end of the file
> {"a":"a_1", "b" : 1}
> 4 run sql 
> select * from my_table ;
> +-+-+--+
> | my_table.a  | my_table.b  |
> +-+-+--+
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> +-+-+--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-15475) JsonSerDe cannot handle json file with empty lines

2019-02-24 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-15475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR resolved HIVE-15475.

Resolution: Fixed

> JsonSerDe cannot handle json file with empty lines
> --
>
> Key: HIVE-15475
> URL: https://issues.apache.org/jira/browse/HIVE-15475
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1
>Reporter: pin_zhang
>Priority: Major
>
> 1. start HiveServer2 in apache-hive-1.2.1
> 2 start a beeline connect to hive server2
>   ADD JAR  ADD JAR 
> /home/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar
>  ;
>CREATE external TABLE my_table(a string, b bigint)
> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> STORED AS TEXTFILE
> location 'file:///home/hive/json';
> 3 put a file with more than one new lines at the end of the file
> {"a":"a_1", "b" : 1}
> 4 run sql 
> select * from my_table ;
> +-+-+--+
> | my_table.a  | my_table.b  |
> +-+-+--+
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> +-+-+--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-15475) JsonSerDe cannot handle json file with empty lines

2019-02-23 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775965#comment-16775965
 ] 

BELUGA BEHR edited comment on HIVE-15475 at 2/23/19 6:16 PM:
-

I've been digging into this as part of HIVE-21240.

I'm pretty sure that this is related to [MAPREDUCE-6549], [MAPREDUCE-6481], 
[MAPREDUCE-6558] which have all been fixed in Hadoop 2.6.3/2.6.5

However, Hive 2.1 uses Hadoop 2.6.1:

https://github.com/apache/hive/blob/rel/release-2.1.1/pom.xml#L135

You have to use Hive 2.2.0 or higher:


https://github.com/apache/hive/blob/rel/release-2.2.0/pom.xml#L141



was (Author: belugabehr):
I've been digging into this as part of HIVE-21240.

I'm pretty sure that this is related to [MAPREDUCE-6549], [MAPREDUCE-6481], 
[MAPREDUCE-6558] which have all been fixed in Hadoop 2.6.3/2.6.5

However, Hive 2.1 uses Hadoop 2.6.1:

https://github.com/apache/hive/blob/rel/release-2.1.1/pom.xml#L135

You have to use Hive 2.2.1 or higher:


https://github.com/apache/hive/blob/rel/release-2.2.0/pom.xml#L141


> JsonSerDe cannot handle json file with empty lines
> --
>
> Key: HIVE-15475
> URL: https://issues.apache.org/jira/browse/HIVE-15475
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1
>Reporter: pin_zhang
>Priority: Major
>
> 1. start HiveServer2 in apache-hive-1.2.1
> 2 start a beeline connect to hive server2
>   ADD JAR  ADD JAR 
> /home/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar
>  ;
>CREATE external TABLE my_table(a string, b bigint)
> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> STORED AS TEXTFILE
> location 'file:///home/hive/json';
> 3 put a file with more than one new lines at the end of the file
> {"a":"a_1", "b" : 1}
> 4 run sql 
> select * from my_table ;
> +-+-+--+
> | my_table.a  | my_table.b  |
> +-+-+--+
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> +-+-+--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-15475) JsonSerDe cannot handle json file with empty lines

2019-02-23 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-15475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775965#comment-16775965
 ] 

BELUGA BEHR commented on HIVE-15475:


I've been digging into this as part of HIVE-21240.

I'm pretty sure that this is related to [MAPREDUCE-6549], [MAPREDUCE-6481], 
[MAPREDUCE-6558] which have all been fixed in Hadoop 2.6.3/2.6.5

However, Hive 2.1 uses Hadoop 2.6.1:

https://github.com/apache/hive/blob/rel/release-2.1.1/pom.xml#L135

You have to use Hive 2.2.1 or higher:


https://github.com/apache/hive/blob/rel/release-2.2.0/pom.xml#L141


> JsonSerDe cannot handle json file with empty lines
> --
>
> Key: HIVE-15475
> URL: https://issues.apache.org/jira/browse/HIVE-15475
> Project: Hive
>  Issue Type: Bug
>  Components: HCatalog
>Affects Versions: 1.2.1
>Reporter: pin_zhang
>Priority: Major
>
> 1. start HiveServer2 in apache-hive-1.2.1
> 2 start a beeline connect to hive server2
>   ADD JAR  ADD JAR 
> /home/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar
>  ;
>CREATE external TABLE my_table(a string, b bigint)
> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> STORED AS TEXTFILE
> location 'file:///home/hive/json';
> 3 put a file with more than one new lines at the end of the file
> {"a":"a_1", "b" : 1}
> 4 run sql 
> select * from my_table ;
> +-+-+--+
> | my_table.a  | my_table.b  |
> +-+-+--+
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> | a_1 | 1   |
> +-+-+--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21246) Un-bury DelimitedJSONSerDe from PlanUtils.java

2019-02-23 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775955#comment-16775955
 ] 

BELUGA BEHR commented on HIVE-21246:


[~ngangam] [~pvary] :)

> Un-bury DelimitedJSONSerDe from PlanUtils.java
> --
>
> Key: HIVE-21246
> URL: https://issues.apache.org/jira/browse/HIVE-21246
> Project: Hive
>  Issue Type: Improvement
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: HIVE-21246.1.patch, HIVE-21246.1.patch, 
> HIVE-21246.1.patch
>
>
> Ultimately, I'd like to get rid of 
> {{org.apache.hadoop.hive.serde2.DelimitedJSONSerDe}}, but for now, trying to 
> make it easier to get rid of later.  It's currently buried in 
> {{PlanUtils.java}}.
> A SerDe and a boolean flag gets passed into these methods.  If the flag is 
> set to true, the specified SerDe is overwritten and assigned to 
> {{DelimitedJSONSerDe}}.  This is not documented anywhere and it's a weird 
> thing to do, just pass in the required SerDe from the start instead of 
> sending the wrong SerDe and a flag to overwrite it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21303) Update TextRecordReader

2019-02-23 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21303:
---
Attachment: (was: HIVE-21303.1.patch)

> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21303) Update TextRecordReader

2019-02-23 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775944#comment-16775944
 ] 

BELUGA BEHR commented on HIVE-21303:


[~pvary] [~ngangam] :)

> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: HIVE-21240.9.patch

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Patch Available  (was: Open)

Added patch to fix JSON writer when using derived column names (_c0, _c1, etc.)

OK.  So, the Kafka_Handler Q-Test fails locally on trunk as well, so please 
ignore that UT failure.  If Jenkins comes back clean, please consider accepting 
[^HIVE-21240.9.patch] for inclusion into the project.

 

Reads with this SerDe are a bit quicker, writes, a bit slower.  I'm not exactly 
sure what makes the reads faster, but the slower writes are expected as the 
writer more fully utilizes the Jackson library whereas the current 
implementation uses its own writing mechanisms that is very lightweight.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.9.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: (was: HIVE-24240.8.patch)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: (was: HIVE-24240.8.patch)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Open  (was: Patch Available)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: (was: HIVE-24240.8.patch)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: (was: HIVE-21240.8.patch)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: (was: HIVE-21240.8.patch)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Comment: was deleted

(was: Though, I am getting a failure in some scenarios that are not picked up 
in the UTs.  I need to investigate them further.)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Comment: was deleted

(was: OK, I figured out the issue.  I am running this SerDe in CDH 6.1 (based 
on Hive 2.2) and it fails with a version-mismatch issue when handling dates.

 

This patch contains a JsonSerDe which is faster (read) and more feature rich 
than the existing JsonSerde.  Please accept the latest patch for inclusion into 
the project.)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775543#comment-16775543
 ] 

BELUGA BEHR edited comment on HIVE-21240 at 2/22/19 7:41 PM:
-

OK, I figured out the issue.  I am running this SerDe in CDH 6.1 (based on Hive 
2.2) and it fails with a version-mismatch issue when handling dates.

 

This patch contains a JsonSerDe which is faster (read) and more feature rich 
than the existing JsonSerde.  Please accept the latest patch for inclusion into 
the project.


was (Author: belugabehr):
OK, I figured out the issue.  I am running this SerDe in CDH 6.1 and it fails 
with a version-mismatch issue when handling dates.

 

This patch contains a JsonSerDe which is faster (read) and more feature rich 
than the existing JsonSerde.  Please accept the latest patch for inclusion into 
the project.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775543#comment-16775543
 ] 

BELUGA BEHR commented on HIVE-21240:


OK, I figured out the issue.  I am running this SerDe in CDH 6.1 and it fails 
with a version-mismatch issue when handling dates.

 

This patch contains a JsonSerDe which is faster (read) and more feature rich 
than the existing JsonSerde.  Please accept the latest patch for inclusion into 
the project.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775343#comment-16775343
 ] 

BELUGA BEHR commented on HIVE-21240:


Though, I am getting a failure in some scenarios that are not picked up in the 
UTs.  I need to investigate them further.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-22 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775331#comment-16775331
 ] 

BELUGA BEHR commented on HIVE-21240:


Read Performance

195 million JSON records (String, int, float, Date)

 

JSON-Trunk: 160s

JSON-21240: 147s

 

 

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-21 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774730#comment-16774730
 ] 

BELUGA BEHR commented on HIVE-21240:


OK.  My branch was a bit behind the trunk.  Kafka Handler is using JSON SerDe 
so I will need to look more closely to see if these unit test failures are 
related.

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21303) Update TextRecordReader

2019-02-21 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21303:
---
Status: Open  (was: Patch Available)

> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch, HIVE-21303.1.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21303) Update TextRecordReader

2019-02-21 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21303:
---
Attachment: HIVE-21303.1.patch

> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch, HIVE-21303.1.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21303) Update TextRecordReader

2019-02-21 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21303:
---
Status: Patch Available  (was: Open)

> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch, HIVE-21303.1.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21303) Update TextRecordReader

2019-02-21 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21303:
---
Attachment: HIVE-21303.1.patch

> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21303) Update TextRecordReader

2019-02-21 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR reassigned HIVE-21303:
--


> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21303) Update TextRecordReader

2019-02-21 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21303:
---
Status: Patch Available  (was: Open)

> Update TextRecordReader
> ---
>
> Key: HIVE-21303
> URL: https://issues.apache.org/jira/browse/HIVE-21303
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
> Attachments: HIVE-21303.1.patch
>
>
> Remove use of Deprecated 
> {{org.apache.hadoop.mapred.LineRecordReader.LineReader}}
> For every call to {{next}}, the code dives into the configuration map to see 
> if this feature is enabled.  Just look it up once and cache the value.
> {code:java}
> public int next(Writable row) throws IOException {
> ...
> if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) {
>   return HiveUtils.unescapeText((Text) row);
> }
> return bytesConsumed;
> }
> {code}
> Other clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21275) Lower Logging Level in Operator Class

2019-02-21 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774110#comment-16774110
 ] 

BELUGA BEHR commented on HIVE-21275:


[~ngangam] [~pvary] Please review :)

> Lower Logging Level in Operator Class
> -
>
> Key: HIVE-21275
> URL: https://issues.apache.org/jira/browse/HIVE-21275
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Fix For: 4.0.0, 3.2.0
>
> Attachments: HIVE-21275.1.patch
>
>
> There is an incredible amount of logging generated by the {{Operator}} during 
> the Q-Tests.
> I counted more than 1 *million* lines of pretty useless logging.  Please 
> lower to TRACE level.
> {code}
> 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.JoinOperator: Starting 
> group
> 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.FileSinkOperator: 
> Starting group
> 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.JoinOperator: Starting 
> group
> 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.FileSinkOperator: 
> Starting group
> 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.JoinOperator: Starting 
> group
> 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.FileSinkOperator: 
> Starting group
> 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.JoinOperator: Starting 
> group
> 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.FileSinkOperator: 
> Starting group
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21264) Improvements Around CharTypeInfo

2019-02-21 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774111#comment-16774111
 ] 

BELUGA BEHR commented on HIVE-21264:


[~ngangam] [~pvary] Please review :)

> Improvements Around CharTypeInfo
> 
>
> Key: HIVE-21264
> URL: https://issues.apache.org/jira/browse/HIVE-21264
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0, 3.2.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Minor
> Attachments: HIVE-21264.1.patch, HIVE-21264.2.patch
>
>
> The {{CharTypeInfo}} stores the type name of the data type (char/varchar) and 
> the length (1-255).  {{CharTypeInfo}} objects are often getting cached once 
> they are created.
> The {{hashcode()}} and {{equals()}} of its sub-classes varchar and char are 
> inconsistent.
> * Make hashcode and equals consistent (and fast)
> * Simplify the {{getQualifiedName}} implementation and reduce the scope to 
> protected
> * Other related nits



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21289) Expect EQ and LIKE to Generate the Identical Explain Plans

2019-02-19 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21289:
---
Description: 
I generated some test data with the UUID function.

{code:sql}
explain select * from test_like where a like 
'abce6254-d437-426b-8873-2cbc153ddfbc';
explain select * from test_like where a = 
'abce6254-d437-426b-8873-2cbc153ddfbc';
{code}

{code}
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: test_like
filterExpr: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: 
boolean)
Statistics: Num rows: 262144 Data size: 9437184 Basic stats: 
COMPLETE Column stats: NONE
Filter Operator
  predicate: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: 
boolean)
  Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: a (type: string)
outputColumnNames: _col0
Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
{code}

{code}
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: test_like
filterExpr: (a = 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: 
boolean)
Statistics: Num rows: 262144 Data size: 9437184 Basic stats: 
COMPLETE Column stats: NONE
Filter Operator
  predicate: (a = 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: 
boolean)
  Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: 'abce6254-d437-426b-8873-2cbc153ddfbc' (type: 
string)
outputColumnNames: _col0
Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink
{code}

They may be the same under the covers, but I would expect the EXPLAIN plan to 
be exactly the same.

  was:
I generated some test data with the UUID function.

{code:sql}
explain select * from test_like where a like 
'abce6254-d437-426b-8873-2cbc153ddfbc';
explain select * from test_like where a = 
'abce6254-d437-426b-8873-2cbc153ddfbc';
{code}

{code|title=LIKE}
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Map Operator Tree:
  TableScan
alias: test_like
filterExpr: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: 
boolean)
Statistics: Num rows: 262144 Data size: 9437184 Basic stats: 
COMPLETE Column stats: NONE
Filter Operator
  predicate: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: 
boolean)
  Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
  Select Operator
expressions: a (type: string)
outputColumnNames: _col0
Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 131072 Data size: 4718592 Basic stats: 
COMPLETE Column stats: NONE
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: 

[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write

2019-02-18 Thread BELUGA BEHR (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771441#comment-16771441
 ] 

BELUGA BEHR commented on HIVE-21240:


I do not believe this failed unit test is related.  Please consider the latest 
patch for inclusion into the project. [^HIVE-24240.8.patch] 

{code:java}
2019-02-18T14:55:57,783 DEBUG [pool-17-thread-1] clients.NetworkClient: 
[Consumer clientId=958935173, groupId=] Initiating connection to node 
localhost:9093 (id: -1 rack: null)
2019-02-18T14:55:57,785 DEBUG [pool-17-thread-1] network.Selector: [Consumer 
clientId=958935173, groupId=] Connection with localhost/127.0.0.1 disconnected
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) 
~[?:1.8.0_191]
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) 
~[?:1.8.0_191]
at 
org.apache.kafka.common.network.PlaintextTransportLayer.finishConnect(PlaintextTransportLayer.java:50)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.kafka.common.network.KafkaChannel.finishConnect(KafkaChannel.java:152)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:471) 
~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.kafka.common.network.Selector.poll(Selector.java:425) 
~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) 
~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:271)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:218)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:274)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1774)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1742)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.kafka.KafkaInputFormat.fetchTopicPartitions(KafkaInputFormat.java:189)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.kafka.KafkaInputFormat.lambda$buildFullScanFromKafka$0(KafkaInputFormat.java:96)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.hadoop.hive.kafka.RetryUtils.retry(RetryUtils.java:93) 
~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.hadoop.hive.kafka.RetryUtils.retry(RetryUtils.java:116) 
~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at org.apache.hadoop.hive.kafka.RetryUtils.retry(RetryUtils.java:109) 
~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.kafka.KafkaInputFormat.buildFullScanFromKafka(KafkaInputFormat.java:98)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.kafka.KafkaInputFormat.lambda$computeSplits$5(KafkaInputFormat.java:135)
 ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[?:1.8.0_191]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_191]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
2019-02-18T14:55:57,787 DEBUG [pool-17-thread-1] clients.NetworkClient: 
[Consumer clientId=958935173, groupId=] Node -1 disconnected.
2019-02-18T14:55:57,787  WARN [pool-17-thread-1] clients.NetworkClient: 
[Consumer clientId=958935173, groupId=] Connection to node -1 could not be 
established. Broker may not be available.
2019-02-18T14:55:57,787 DEBUG [pool-17-thread-1] 
internals.ConsumerNetworkClient: [Consumer clientId=958935173, groupId=] 
Cancelled request with header RequestHeader(apiKey=METADATA, apiVersion=6, 
clientId=958935173, correlationId=32) due to node -1 being disconnected
2019-02-18T14:55:57,888 DEBUG [pool-17-thread-1] clients.NetworkClient: 
[Consumer clientId=958935173, groupId=] Give up sending metadata request since 
no node is available
2019-02-18T14:55:57,990 DEBUG [pool-17-thread-1] clients.NetworkClient: 
[Consumer clientId=958935173, groupId=] Give up sending metadata request since 
no node is available
2019-02-18T14:55:58,056 DEBUG 

[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-18 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Patch Available  (was: Open)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-18 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Open  (was: Patch Available)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-18 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: HIVE-21240.8.patch

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-16 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Patch Available  (was: Open)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-16 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Attachment: HIVE-21240.8.patch

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 4.0.0, 3.1.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-21240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write

2019-02-16 Thread BELUGA BEHR (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BELUGA BEHR updated HIVE-21240:
---
Status: Open  (was: Patch Available)

> JSON SerDe Re-Write
> ---
>
> Key: HIVE-21240
> URL: https://issues.apache.org/jira/browse/HIVE-21240
> Project: Hive
>  Issue Type: Improvement
>  Components: Serializers/Deserializers
>Affects Versions: 3.1.1, 4.0.0
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, 
> HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, 
> HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >