[jira] [Updated] (HIVE-19844) Make CSV SerDe First-Class SerDe
[ https://issues.apache.org/jira/browse/HIVE-19844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-19844: --- Description: According to the [Hive SerDe Docs|https://cwiki.apache.org/confluence/display/Hive/CSV+Serde], there are some extras steps involved in getting the CSV SerDe working with Hive. {code} CREATE TABLE my_table(a string, b string, ...) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "\t", "quoteChar" = "'", "escapeChar"= "\\" ) STORED AS TEXTFILE; {code} I would like to propose that we move this SerDe into first-class status: {{STORED AS CSVFILE}} {{STORED AS TSVFILE}} The user should have to perform no additional steps to use this SerDe. was: According to the [Hive SerDe Docs|https://cwiki.apache.org/confluence/display/Hive/CSV+Serde], there are some extras steps involved in getting the CSV SerDe working with Hive. {code} CREATE TABLE my_table(a string, b string, ...) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "\t", "quoteChar" = "'", "escapeChar"= "\\" ) STORED AS TEXTFILE; {code} I would like to propose that we move this SerDe into first-class status: {{STORED AS TEXT_CSV}} {{STORED AS TEXT_TSV}} The user should have to perform no additional steps to use this SerDe. > Make CSV SerDe First-Class SerDe > > > Key: HIVE-19844 > URL: https://issues.apache.org/jira/browse/HIVE-19844 > Project: Hive > Issue Type: Improvement > Components: HiveServer2, Serializers/Deserializers >Affects Versions: 3.0.0, 2.3.2, 4.0.0 >Reporter: BELUGA BEHR >Priority: Major > > According to the [Hive SerDe > Docs|https://cwiki.apache.org/confluence/display/Hive/CSV+Serde], there are > some extras steps involved in getting the CSV SerDe working with Hive. > {code} > CREATE TABLE my_table(a string, b string, ...) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > WITH SERDEPROPERTIES ( >"separatorChar" = "\t", >"quoteChar" = "'", >"escapeChar"= "\\" > ) > STORED AS TEXTFILE; > {code} > I would like to propose that we move this SerDe into first-class status: > {{STORED AS CSVFILE}} > {{STORED AS TSVFILE}} > The user should have to perform no additional steps to use this SerDe. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HIVE-19844) Make CSV SerDe First-Class SerDe
[ https://issues.apache.org/jira/browse/HIVE-19844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned HIVE-19844: -- Assignee: (was: anand) > Make CSV SerDe First-Class SerDe > > > Key: HIVE-19844 > URL: https://issues.apache.org/jira/browse/HIVE-19844 > Project: Hive > Issue Type: Improvement > Components: HiveServer2, Serializers/Deserializers >Affects Versions: 3.0.0, 2.3.2, 4.0.0 >Reporter: BELUGA BEHR >Priority: Major > > According to the [Hive SerDe > Docs|https://cwiki.apache.org/confluence/display/Hive/CSV+Serde], there are > some extras steps involved in getting the CSV SerDe working with Hive. > {code} > CREATE TABLE my_table(a string, b string, ...) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > WITH SERDEPROPERTIES ( >"separatorChar" = "\t", >"quoteChar" = "'", >"escapeChar"= "\\" > ) > STORED AS TEXTFILE; > {code} > I would like to propose that we move this SerDe into first-class status: > {{STORED AS TEXT_CSV}} > {{STORED AS TEXT_TSV}} > The user should have to perform no additional steps to use this SerDe. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21372) Use Apache Commons IO To Read Stream To String
[ https://issues.apache.org/jira/browse/HIVE-21372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21372: --- Attachment: HIVE-21372.1.patch > Use Apache Commons IO To Read Stream To String > -- > > Key: HIVE-21372 > URL: https://issues.apache.org/jira/browse/HIVE-21372 > Project: Hive > Issue Type: Improvement >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Fix For: 4.0.0 > > Attachments: HIVE-21372.1.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HIVE-21372) Use Apache Commons IO To Read Stream To String
[ https://issues.apache.org/jira/browse/HIVE-21372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned HIVE-21372: -- Assignee: BELUGA BEHR > Use Apache Commons IO To Read Stream To String > -- > > Key: HIVE-21372 > URL: https://issues.apache.org/jira/browse/HIVE-21372 > Project: Hive > Issue Type: Improvement >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Fix For: 4.0.0 > > Attachments: HIVE-21372.1.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21372) Use Apache Commons IO To Read Stream To String
[ https://issues.apache.org/jira/browse/HIVE-21372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21372: --- Status: Patch Available (was: Open) > Use Apache Commons IO To Read Stream To String > -- > > Key: HIVE-21372 > URL: https://issues.apache.org/jira/browse/HIVE-21372 > Project: Hive > Issue Type: Improvement >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Trivial > Fix For: 4.0.0 > > Attachments: HIVE-21372.1.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HIVE-21371) Make NonSyncByteArrayOutputStream Overflow Conscious
[ https://issues.apache.org/jira/browse/HIVE-21371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned HIVE-21371: -- Assignee: BELUGA BEHR > Make NonSyncByteArrayOutputStream Overflow Conscious > - > > Key: HIVE-21371 > URL: https://issues.apache.org/jira/browse/HIVE-21371 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21371.1.patch > > > {code:java|title=NonSyncByteArrayOutputStream} > private int enLargeBuffer(int increment) { > int temp = count + increment; > int newLen = temp; > if (temp > buf.length) { > if ((buf.length << 1) > temp) { > newLen = buf.length << 1; > } > byte newbuf[] = new byte[newLen]; > System.arraycopy(buf, 0, newbuf, 0, count); > buf = newbuf; > } > return newLen; > } > {code} > This will fail if the array is 2GB or larger because it will double the size > every time without consideration for the 4GB limit on arrays. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21371) Make NonSyncByteArrayOutputStream Overflow Conscious
[ https://issues.apache.org/jira/browse/HIVE-21371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21371: --- Status: Patch Available (was: Open) > Make NonSyncByteArrayOutputStream Overflow Conscious > - > > Key: HIVE-21371 > URL: https://issues.apache.org/jira/browse/HIVE-21371 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21371.1.patch > > > {code:java|title=NonSyncByteArrayOutputStream} > private int enLargeBuffer(int increment) { > int temp = count + increment; > int newLen = temp; > if (temp > buf.length) { > if ((buf.length << 1) > temp) { > newLen = buf.length << 1; > } > byte newbuf[] = new byte[newLen]; > System.arraycopy(buf, 0, newbuf, 0, count); > buf = newbuf; > } > return newLen; > } > {code} > This will fail if the array is 2GB or larger because it will double the size > every time without consideration for the 4GB limit on arrays. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21371) Make NonSyncByteArrayOutputStream Overflow Conscious
[ https://issues.apache.org/jira/browse/HIVE-21371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21371: --- Attachment: HIVE-21371.1.patch > Make NonSyncByteArrayOutputStream Overflow Conscious > - > > Key: HIVE-21371 > URL: https://issues.apache.org/jira/browse/HIVE-21371 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21371.1.patch > > > {code:java|title=NonSyncByteArrayOutputStream} > private int enLargeBuffer(int increment) { > int temp = count + increment; > int newLen = temp; > if (temp > buf.length) { > if ((buf.length << 1) > temp) { > newLen = buf.length << 1; > } > byte newbuf[] = new byte[newLen]; > System.arraycopy(buf, 0, newbuf, 0, count); > buf = newbuf; > } > return newLen; > } > {code} > This will fail if the array is 2GB or larger because it will double the size > every time without consideration for the 4GB limit on arrays. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3
[ https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21370: --- Description: Empty lines lead to duplicated records on output. > JsonSerDe cannot handle json file with empty lines - Branch 3 > - > > Key: HIVE-21370 > URL: https://issues.apache.org/jira/browse/HIVE-21370 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Critical > Fix For: 3.2.0 > > Attachments: HIVE-21370.1.patch > > > Empty lines lead to duplicated records on output. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3
[ https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781992#comment-16781992 ] BELUGA BEHR commented on HIVE-21370: I'm not sure how to get it to YETUS against the 3.x branch. Any assistance would be appreciated. > JsonSerDe cannot handle json file with empty lines - Branch 3 > - > > Key: HIVE-21370 > URL: https://issues.apache.org/jira/browse/HIVE-21370 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Fix For: 3.2.0 > > Attachments: HIVE-21370.1.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3
[ https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21370: --- Issue Type: Bug (was: Improvement) > JsonSerDe cannot handle json file with empty lines - Branch 3 > - > > Key: HIVE-21370 > URL: https://issues.apache.org/jira/browse/HIVE-21370 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Critical > Fix For: 3.2.0 > > Attachments: HIVE-21370.1.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3
[ https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21370: --- Priority: Critical (was: Major) > JsonSerDe cannot handle json file with empty lines - Branch 3 > - > > Key: HIVE-21370 > URL: https://issues.apache.org/jira/browse/HIVE-21370 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Critical > Fix For: 3.2.0 > > Attachments: HIVE-21370.1.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3
[ https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21370: --- Status: Patch Available (was: Open) > JsonSerDe cannot handle json file with empty lines - Branch 3 > - > > Key: HIVE-21370 > URL: https://issues.apache.org/jira/browse/HIVE-21370 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.0, 3.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Fix For: 3.2.0 > > Attachments: HIVE-21370.1.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3
[ https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781984#comment-16781984 ] BELUGA BEHR commented on HIVE-21370: This issue is already fixed in 4.0 with other changes. Just fixing the 3.x branch with this patch. > JsonSerDe cannot handle json file with empty lines - Branch 3 > - > > Key: HIVE-21370 > URL: https://issues.apache.org/jira/browse/HIVE-21370 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Fix For: 3.2.0 > > Attachments: HIVE-21370.1.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3
[ https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21370: --- Attachment: HIVE-21370.1.patch > JsonSerDe cannot handle json file with empty lines - Branch 3 > - > > Key: HIVE-21370 > URL: https://issues.apache.org/jira/browse/HIVE-21370 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Fix For: 3.2.0 > > Attachments: HIVE-21370.1.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HIVE-21370) JsonSerDe cannot handle json file with empty lines - Branch 3
[ https://issues.apache.org/jira/browse/HIVE-21370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned HIVE-21370: -- Assignee: BELUGA BEHR > JsonSerDe cannot handle json file with empty lines - Branch 3 > - > > Key: HIVE-21370 > URL: https://issues.apache.org/jira/browse/HIVE-21370 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Fix For: 3.2.0 > > Attachments: HIVE-21370.1.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21246) Un-bury DelimitedJSONSerDe from PlanUtils.java
[ https://issues.apache.org/jira/browse/HIVE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21246: --- Attachment: (was: HIVE-21246.1.patch) > Un-bury DelimitedJSONSerDe from PlanUtils.java > -- > > Key: HIVE-21246 > URL: https://issues.apache.org/jira/browse/HIVE-21246 > Project: Hive > Issue Type: Improvement >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: HIVE-21246.1.patch, HIVE-21246.1.patch > > > Ultimately, I'd like to get rid of > {{org.apache.hadoop.hive.serde2.DelimitedJSONSerDe}}, but for now, trying to > make it easier to get rid of later. It's currently buried in > {{PlanUtils.java}}. > A SerDe and a boolean flag gets passed into these methods. If the flag is > set to true, the specified SerDe is overwritten and assigned to > {{DelimitedJSONSerDe}}. This is not documented anywhere and it's a weird > thing to do, just pass in the required SerDe from the start instead of > sending the wrong SerDe and a flag to overwrite it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21354) Lock The Entire Table If Majority Of Partitions Are Locked
[ https://issues.apache.org/jira/browse/HIVE-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781159#comment-16781159 ] BELUGA BEHR commented on HIVE-21354: Thanks for the input [~gopalv]. What about just a simple {{SELECT * FROM TABLE WHERE (non-partitioned-value)=?}} > Lock The Entire Table If Majority Of Partitions Are Locked > -- > > Key: HIVE-21354 > URL: https://issues.apache.org/jira/browse/HIVE-21354 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Priority: Major > > One of the bottlenecks of any Hive query is the ZooKeeper locking mechanism. > When a Hive query interacts with a table which has a lot of partitions, this > may put a lot of stress on the ZK system. > Please add a heuristic that works like this: > # Count the number of partitions that a query is required to lock > # Obtain the total number of partitions in the table > # If the number of partitions accessed by the query is greater than or equal > to half the total number of partitions, simply create one ZNode lock at the > table level. > This would improve performance of many queries, but in particular, a {{select > count(1) from table}} ... or ... {{select * from table limit 5}} where the > table has many partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21354) Lock The Entire Table If Majority Of Partitions Are Locked
[ https://issues.apache.org/jira/browse/HIVE-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780821#comment-16780821 ] BELUGA BEHR commented on HIVE-21354: bq. This is only true when you disable ACID So does this only apply to ORC tables? Does ACIDv2 apply to Parquet, Avro, JSON, etc? > Lock The Entire Table If Majority Of Partitions Are Locked > -- > > Key: HIVE-21354 > URL: https://issues.apache.org/jira/browse/HIVE-21354 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Priority: Major > > One of the bottlenecks of any Hive query is the ZooKeeper locking mechanism. > When a Hive query interacts with a table which has a lot of partitions, this > may put a lot of stress on the ZK system. > Please add a heuristic that works like this: > # Count the number of partitions that a query is required to lock > # Obtain the total number of partitions in the table > # If the number of partitions accessed by the query is greater than or equal > to half the total number of partitions, simply create one ZNode lock at the > table level. > This would improve performance of many queries, but in particular, a {{select > count(1) from table}} ... or ... {{select * from table limit 5}} where the > table has many partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780694#comment-16780694 ] BELUGA BEHR commented on HIVE-21240: Hey Team, Any other comments, questions, concerns? > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HIVE-21356) Upgrade Jackson to 2.9.8
[ https://issues.apache.org/jira/browse/HIVE-21356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned HIVE-21356: -- Assignee: BELUGA BEHR > Upgrade Jackson to 2.9.8 > > > Key: HIVE-21356 > URL: https://issues.apache.org/jira/browse/HIVE-21356 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Fix For: 4.0.0 > > > Currently at: > {code} > 2.9.5 > {code} > Upgrade to 2.9.8 - contains some improvements for processing Base64 data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21352) Drop INDEX from 3.0 Schema
[ https://issues.apache.org/jira/browse/HIVE-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21352: --- Description: We dropped support for Hive indexes starting in 3.0, however there are still tables in Metastore to support it. Please remove. https://github.com/apache/hive/blob/master/metastore/scripts/upgrade/mysql/hive-schema-2.3.0.mysql.sql#L147-L165 was:We dropped support for Hive indexes starting in 3.0, however there are still tables in Metastore to support it. Please remove. > Drop INDEX from 3.0 Schema > -- > > Key: HIVE-21352 > URL: https://issues.apache.org/jira/browse/HIVE-21352 > Project: Hive > Issue Type: Improvement > Components: Metastore >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Priority: Minor > > We dropped support for Hive indexes starting in 3.0, however there are still > tables in Metastore to support it. Please remove. > https://github.com/apache/hive/blob/master/metastore/scripts/upgrade/mysql/hive-schema-2.3.0.mysql.sql#L147-L165 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21347) Store Partition Count in TBLS
[ https://issues.apache.org/jira/browse/HIVE-21347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21347: --- Description: Please store a count of the number of partitions each table has in the {{TBLS}} table. This will allow very quick lookups for tables with many partitions. (was: Please store a count of the number of partitions each table has in the ```TBLS``` table. This will allow very quick lookups for tables with many partitions.) > Store Partition Count in TBLS > - > > Key: HIVE-21347 > URL: https://issues.apache.org/jira/browse/HIVE-21347 > Project: Hive > Issue Type: Improvement > Components: Metastore >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Priority: Major > > Please store a count of the number of partitions each table has in the > {{TBLS}} table. This will allow very quick lookups for tables with many > partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21210) CombineHiveInputFormat Thread Pool Sizing
[ https://issues.apache.org/jira/browse/HIVE-21210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779954#comment-16779954 ] BELUGA BEHR commented on HIVE-21210: [~zchovan] Can I get your thoughts on this one? :) > CombineHiveInputFormat Thread Pool Sizing > - > > Key: HIVE-21210 > URL: https://issues.apache.org/jira/browse/HIVE-21210 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21210.1.patch, HIVE-21210.2.patch, > HIVE-21210.3.patch, HIVE-21210.4.patch, HIVE-21210.5.patch, > HIVE-21210.6.patch, HIVE-21210.7.patch, HIVE-21210.8.patch > > > Threadpools. > Hive uses threadpools in several different places and each implementation is > a little different and requires different configurations. I think that Hive > needs to reign in and standardize the way that threadpools are used and > threadpools should scale automatically without manual configuration. At any > given time, there are many hundreds of threads running in the HS2 as the > number of simultaneous connections increases and they surely cause contention > with one-another. > Here is an example: > {code:java|title=CombineHiveInputFormat.java} > // max number of threads we can use to check non-combinable paths > private static final int MAX_CHECK_NONCOMBINABLE_THREAD_NUM = 50; > private static final int DEFAULT_NUM_PATH_PER_THREAD = 100; > {code} > When building the splits for a MR job, there are up to 50 threads running per > query and there is not much scaling here, it's simply 1 thread : 100 files > ratio. This implies that to process 5000 files, there are 50 threads, after > that, 50 threads are still used. Many Hive jobs these days involve more than > 5000 files so it's not scaling well on bigger sizes. > This is not configurable (even manually), it doesn't change when the hardware > specs increase, and 50 threads seems like a lot when a service must support > up to 80 connections: > [https://www.cloudera.com/documentation/enterprise/5/latest/topics/admin_hive_tuning.html] > Not to mention, I have never seen a scenario where HS2 is running on a host > all by itself and has the entire system dedicated to it. Therefore it should > be more friendly and spin up fewer threads. > I am attaching a patch here that provides a few features: > * Common module that produces {{ExecutorService}} which caps the number of > threads it spins up at the number of processors a host has. Keep in mind that > a class may submit as much work units ({{Callables}} as they would like, but > the number of threads in the pool is capped. > * Common module for partitioning work. That is, allow for a generic > framework for dividing work into partitions (i.e. batches) > * Modify {{CombineHiveInputFormat}} to take advantage of both modules, > performing its same duties in a more Java OO way that is currently implemented > * Add a partitioning (batching) implementation that enforces partitioning of > a {{Collection}} based on the natural log of the {{Collection}} size so that > it scales more slowly than a simple 1:100 ratio. > * Simplify unit test code for {{CombineHiveInputFormat}} > My hope is to introduce these tools to {{CombineHiveInputFormat}} and then to > drop it into other places. One of the things I will introduce here is a > "direct thread" {{ExecutorService}} so that even if there is a configuration > for a thread pool to be disabled, it will still use an {{ExecutorService}} so > that the project can avoid logic like "if this function is services by a > thread pool, use a {{ExecutorService}} (and remember to close it later!) > otherwise, create a single thread" so that things like [HIVE-16949] can be > avoided in the future. Everything will just use an {{ExecutorService}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779501#comment-16779501 ] BELUGA BEHR commented on HIVE-21240: Just to be clear... Avro already uses a {{List}} as the return type, so I'm just bringing JsonSerde into alignment with the rest. https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/avro/AvroDeserializer.java#L143 > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779499#comment-16779499 ] BELUGA BEHR commented on HIVE-21240: [~bslim] With a large project like Hive, maintained by many different supporters and countless number of additional troubleshooters that dig through the code to resolve issues, it is all the more important to adhere to best practices. With few exceptions, everything should be a Java Collection. Making smart choices about the actual data structures used (Set, Map, List, etc.) is going to yield much more benefit than trying to manipulate primitive arrays. I've never had a Hive user complain that they wished it was 2% faster, but I hear all the time about how complicated the product is and how difficult it is to troubleshoot. There are a few books written on the topic which I won't regurgitate here, but I think this sums it up well: https://stackoverflow.com/questions/6100148/collection-interface-vs-arrays > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778846#comment-16778846 ] BELUGA BEHR commented on HIVE-21240: All unit tests are passing [~bslim] [~kgyrtkirk]. Please consider this patch for inclusion into the project. I understand there is some hesitation regarding the change in return type. Previous a native array was returned and now a Collection (List) is returned by the SerDe. I think it's better to work with Java Collections instead of native arrays and if we're going to change the return value at all, this is an appropriate time to introduce such a change, i.e., in a major (4.0) release. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778846#comment-16778846 ] BELUGA BEHR edited comment on HIVE-21240 at 2/27/19 3:44 AM: - All unit tests are passing [~bslim] [~kgyrtkirk]. Please consider this patch for inclusion into the project. I understand there is some hesitation regarding the change in return type. Previous a native array was returned and now (with this patch) a Collection (List) is returned by the SerDe. I think it's better to work with Java Collections instead of native arrays and if we're going to change the return value, this is an appropriate time to introduce such a change, i.e., in a major (4.0) release. was (Author: belugabehr): All unit tests are passing [~bslim] [~kgyrtkirk]. Please consider this patch for inclusion into the project. I understand there is some hesitation regarding the change in return type. Previous a native array was returned and now (with this patch) a Collection (List) is returned by the SerDe. I think it's better to work with Java Collections instead of native arrays and if we're going to change the return value at all, this is an appropriate time to introduce such a change, i.e., in a major (4.0) release. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778846#comment-16778846 ] BELUGA BEHR edited comment on HIVE-21240 at 2/27/19 3:44 AM: - All unit tests are passing [~bslim] [~kgyrtkirk]. Please consider this patch for inclusion into the project. I understand there is some hesitation regarding the change in return type. Previous a native array was returned and now (with this patch) a Collection (List) is returned by the SerDe. I think it's better to work with Java Collections instead of native arrays and if we're going to change the return value at all, this is an appropriate time to introduce such a change, i.e., in a major (4.0) release. was (Author: belugabehr): All unit tests are passing [~bslim] [~kgyrtkirk]. Please consider this patch for inclusion into the project. I understand there is some hesitation regarding the change in return type. Previous a native array was returned and now a Collection (List) is returned by the SerDe. I think it's better to work with Java Collections instead of native arrays and if we're going to change the return value at all, this is an appropriate time to introduce such a change, i.e., in a major (4.0) release. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: HIVE-21240.11.patch > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Patch Available (was: Open) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Open (was: Patch Available) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778489#comment-16778489 ] BELUGA BEHR commented on HIVE-21240: [~bslim] Can you drop the test for {{kafka_table_2}} since it is no longer testing the 'basic implementation' as is described? > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Patch Available (was: Open) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: HIVE-21240.11.patch > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Open (was: Patch Available) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.11.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Patch Available (was: Open) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.9.patch, HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Open (was: Patch Available) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.9.patch, HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: HIVE-21240.11.patch > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.9.patch, HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: HIVE-21240.11.patch > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Patch Available (was: Open) Posting a patch with all the changes (serde and kafka) and we'll see what we get. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Open (was: Patch Available) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, > HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, > HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, > HIVE-24240.8.patch, kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516 ] BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:54 AM: - [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the {{JsonSerDe}} on my local branch which fixes the {{timestamp with local timezone}} stuff. As you can see, I have populated the values with the timestamp values. Are you expecting all values to be lost (null)? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? was (Author: belugabehr): [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the {{JsonSerDe}} on my local branch which fixes the {{timestamp with local timezone}} stuff. As you can see, I have populated the values with the timestamp values. Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516 ] BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:53 AM: - [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the {{JsonSerDe}} on my local branch which fixes the {{timestamp with local timezone}} stuff. As you can see, I have populated the values with the timestamp values? Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? was (Author: belugabehr): [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the work for {{JsonSerDe}} on my local branch. As you can see, I have populated the values with the timestamp values? Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516 ] BELUGA BEHR edited comment on HIVE-21240 at 2/26/19 2:53 AM: - [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the {{JsonSerDe}} on my local branch which fixes the {{timestamp with local timezone}} stuff. As you can see, I have populated the values with the timestamp values. Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? was (Author: belugabehr): [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the {{JsonSerDe}} on my local branch which fixes the {{timestamp with local timezone}} stuff. As you can see, I have populated the values with the timestamp values? Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777516#comment-16777516 ] BELUGA BEHR commented on HIVE-21240: [~bslim] Thanks for the update. Here is the diff I'm looking at: [^kafka_storage_handler.diff] To pass the test with this diff, it requires that you use the work for {{JsonSerDe}} on my local branch. As you can see, I have populated the values with the timestamp values? Are you expecting all values to be lost? Regarding {{KafkaJsonSerDe}}, if you wish to keep it around, I recommend we move it to the 'test' directory so that it's not shipping with the actual product. If it's not meant for production, we don't want to make it available, because there's always that one person that will find it and use it. However, the Hive {{JsonSerde}} is already the default in the Kafka project, so what is the LOE to use the one included with Hive than to use this test implementation? > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: kafka_storage_handler.diff > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch, > kafka_storage_handler.diff > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777446#comment-16777446 ] BELUGA BEHR commented on HIVE-21240: [~kgyrtkirk] OK. I finally figured it out. cc: [~bslim] There are a few things going on: The Kafka Storage Handler q-test is incorrect. There are several tables that define a {{timestamp}} or a {{timestamp with local time zone}} column. And for every q-test result, the q-test expects a {{NULL}} value in the column. However, I do believe this is incorrect and therefore this q-test is testing for a wrong behavior. There is timestamp data in the test data set and there should be values present, not {{NULL}} values. I believe that the current JSON SerDe is unable to handle the timestamp Strings in the test data set and instead of throwing an Exception, it swallows the bad value and returns a null value. Check it out [here|https://github.com/apache/hive/blob/f37c5de6c32b9395d1b34fa3c02ed06d1bfbf6eb/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java#L1298]. I do not think this is a good behavior. Users may silently lose data and there is no way to configure it otherwise. However, even if it was configurable, the default should be to throw an Exception and to stop the processing, not to lose data. Additionally, there is a q-test that states: {quote}using basic implementation of flat json probably to be removed {quote} This is achieved by not specifying an explicit table SerDe in the q-test. The 'basic implementation' is probably in reference to the class {{KafkaJsonSerDe}}. However, this class is not used anymore as far I can tell and is no longer the default SerDe for the Kafka Storage Handler. The Hive {{JsonSerde}} is [the default serde|https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaTableProperties.java#L38]. One thing you'll notice with this particular q-test is that there is also no {{timestamp.formats}} defined for the table. This is because {{KafkaJsonSerDe}} handles [ISO-8601 format|https://github.com/apache/hive/blob/master/kafka-handler/src/java/org/apache/hadoop/hive/kafka/KafkaJsonSerDe.java#L230] (and only that format) so it does not need to explicitly specify the format. All of the data in this test are formatted as ISO-8601 and if you look at all the other tables in the test, they all pass in an ISO-8601 format string. It is required to pass this format string because the Hive {{JsonSerde}} does not handle that format by default. Without the {{timestamp.formats}} defined on this table, there is no way that the current {{JsonSerde}} is handling this data correctly. It again demonstrates that the current {{JsonSerde}} behavior is to swallow the exception, and return NULL. [https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/kafka_storage_handler.q] [https://github.com/apache/hive/blob/master/ql/src/test/results/clientpositive/kafka/kafka_storage_handler.q.out] What triggered these q-test failures for this jira is three fold: # This JIRA's implementation was in some cases producing valid timestamps instead of NULL # This JIRA's implementation was throwing an exception because it was unable to parse the timestamp String without the format string defined # This implementation was in some cases throwing an exception because it did not support {{timestamp with local time zone}}. Moving forward, I would like to: # Add to this implementation the ability to process {{timestamp with local time zone}}. This is currently not fully supported in the current {{JsonSerde}} implementation. Only [timestamp|https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/json/HiveJsonStructReader.java#L355] is supported. It works in trunk because there's a conversion process that happens immediately prior to this _switch_ statement that is doing the work. # Update the {{kafka_storage_handler}} q-test to check for the correct timestamp values # Remove {{KafkaJsonSerDe}} serde # Remove the "basic implementation of flat json" test from the q-test > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch > >
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777129#comment-16777129 ] BELUGA BEHR commented on HIVE-21240: [~kgyrtkirk] Figured out my local failure for this UT. Will investigate further this failure. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21317) Unit Test kafka_storage_handler Is Failing Regularly
[ https://issues.apache.org/jira/browse/HIVE-21317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21317: --- Priority: Minor (was: Critical) > Unit Test kafka_storage_handler Is Failing Regularly > > > Key: HIVE-21317 > URL: https://issues.apache.org/jira/browse/HIVE-21317 > Project: Hive > Issue Type: Task >Affects Versions: 4.0.0 >Reporter: BELUGA BEHR >Priority: Minor > > {code} > org.apache.hadoop.hive.cli.TestMiniHiveKafkaCliDriver.testCliDriver[kafka_storage_handler] > (batchId=275) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (HIVE-21317) Unit Test kafka_storage_handler Is Failing Regularly
[ https://issues.apache.org/jira/browse/HIVE-21317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR resolved HIVE-21317. Resolution: Not A Problem > Unit Test kafka_storage_handler Is Failing Regularly > > > Key: HIVE-21317 > URL: https://issues.apache.org/jira/browse/HIVE-21317 > Project: Hive > Issue Type: Task >Affects Versions: 4.0.0 >Reporter: BELUGA BEHR >Priority: Critical > > {code} > org.apache.hadoop.hive.cli.TestMiniHiveKafkaCliDriver.testCliDriver[kafka_storage_handler] > (batchId=275) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777102#comment-16777102 ] BELUGA BEHR edited comment on HIVE-21240 at 2/25/19 5:24 PM: - [~kgyrtkirk] Thanks! #1 I'm not sure I understand the first request. Are you talking specifically about the HCat code? Are there missing unit tests here? Is that why it passes even though the data types have been changed? As I see it the native arrays are all transformed into Java Collections: {code:java|title=HCat JsonSerDe} List fatRow = fatLand((Object[]) row); return new DefaultHCatRecord(fatRow); ... return Arrays.asList(ArrayUtils.toObject((int[]) arr)); {code} So, the JSON SerDe should just create Java Collections from the get-go instead of having to transform it later. #2 I noted that the Kafka_Handler Q-Test fails locally on trunk as well. I searched across JIRA and see this test fails across many places. I'm not suggesting that there be an exception to the "all green" policy, simply that I need help with investigating the cause as I believe it is outside the scope of this one JIRA. #3 I don't think there's much value in going back and changing the code and testing it. These proposed changes are not about making the SerDe faster, I just want to put out there that there isn't a huge regression. If it's a bit quicker, than that's an added bonus. was (Author: belugabehr): [~kgyrtkirk] Thanks! #1 I'm not sure I understand the first request. Are you talking specifically about the HCat code? Are there missing unit tests here? Is that why it passes even though the data types have been changed? As I see it the native arrays are all transformed into Java Collections: {code:java|title=HCat JsonSerDe} List fatRow = fatLand((Object[]) row); return new DefaultHCatRecord(fatRow); ... return Arrays.asList(ArrayUtils.toObject((int[]) arr)); {code} So, the JSON SerDe should just create Java Collections from the get-go instead of having to transform it later. #2 I noted that the Kafka_Handler Q-Test fails locally on trunk as well. I searched across JIRA and see this test fails across many places. I can keep looking at it though. #3 I don't think there's much value in going back and changing the code and testing it. These proposed changes are not about making the SerDe faster, I just want to put out there that there isn't a huge regression. If it's a bit quicker, than that's an added bonus. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777102#comment-16777102 ] BELUGA BEHR commented on HIVE-21240: [~kgyrtkirk] Thanks! #1 I'm not sure I understand the first request. Are you talking specifically about the HCat code? Are there missing unit tests here? Is that why it passes even though the data types have been changed? As I see it the native arrays are all transformed into Java Collections: {code:java|title=HCat JsonSerDe} List fatRow = fatLand((Object[]) row); return new DefaultHCatRecord(fatRow); ... return Arrays.asList(ArrayUtils.toObject((int[]) arr)); {code} So, the JSON SerDe should just create Java Collections from the get-go instead of having to transform it later. #2 I noted that the Kafka_Handler Q-Test fails locally on trunk as well. I searched across JIRA and see this test fails across many places. I can keep looking at it though. #3 I don't think there's much value in going back and changing the code and testing it. These proposed changes are not about making the SerDe faster, I just want to put out there that there isn't a huge regression. If it's a bit quicker, than that's an added bonus. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777012#comment-16777012 ] BELUGA BEHR commented on HIVE-21240: [~kgyrtkirk] I created [HIVE-21317] to address the one failing unit test. Can you please review the latest patch? Thanks! > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776989#comment-16776989 ] BELUGA BEHR commented on HIVE-20079: [~asinkovits] Thanks for the clarification. I see now that the current implementation just counts columns. I'm on the same page now. Thanks. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21303: --- Status: Patch Available (was: Open) Thanks [~pvary]. So, what I realized just now is that this class is only used for unit tests. I propose moving this class out of the Hive main code base and into test. Patch included. > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch, HIVE-21303.2.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21303: --- Attachment: HIVE-21303.2.patch > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch, HIVE-21303.2.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21303: --- Status: Open (was: Patch Available) > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20079) Populate more accurate rawDataSize for parquet format
[ https://issues.apache.org/jira/browse/HIVE-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776925#comment-16776925 ] BELUGA BEHR commented on HIVE-20079: This patch is still incorrect. It's actually producing the same wrong numbers as before, though, perhaps a bit more efficiently. {code} totalSize += block.getTotalByteSize(); {code} {{getTotalByteSize()}} is not the same as "rawDataSize". bq. rawDataSize—Approximate size of data in memory https://www.cloudera.com/documentation/enterprise/5-15-x/topics/admin_hos_tuning.html That means that for a single table row with 4 INTs (values: 1,2,3,4) I would expect a rawDataSize of (4 bytes x 4 Java ints) = 32 bytes. However, Parquet would report this as 4 bytes because of the way that Parquet packs these numbers internal to its implementation. Hive should look at the row counts and multiply it by the row data types. The {{AbstractSerDe}} class should have code to facilitate all of this like {{readNumber()}} {{readString(int bumBytes}}, etc that can be called as each row is read. > Populate more accurate rawDataSize for parquet format > - > > Key: HIVE-20079 > URL: https://issues.apache.org/jira/browse/HIVE-20079 > Project: Hive > Issue Type: Improvement > Components: File Formats >Affects Versions: 2.0.0 >Reporter: Aihua Xu >Assignee: Antal Sinkovits >Priority: Major > Attachments: HIVE-20079.1.patch, HIVE-20079.2.patch, > HIVE-20079.3.patch, HIVE-20079.4.patch, HIVE-20079.5.patch, HIVE-20079.6.patch > > > Run the following queries and you will see the raw data for the table is 4 > (that is the number of fields) incorrectly. We need to populate correct data > size so data can be split properly. > {noformat} > SET hive.stats.autogather=true; > CREATE TABLE parquet_stats (id int,str string) STORED AS PARQUET; > INSERT INTO parquet_stats values(0, 'this is string 0'), (1, 'string 1'); > DESC FORMATTED parquet_stats; > {noformat} > {noformat} > Table Parameters: > COLUMN_STATS_ACCURATE true > numFiles1 > numRows 2 > rawDataSize 4 > totalSize 373 > transient_lastDdlTime 1530660523 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Patch Available (was: Open) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: HIVE-21240.10.patch > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Open (was: Patch Available) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.10.patch, HIVE-21240.2.patch, HIVE-21240.3.patch, > HIVE-21240.4.patch, HIVE-21240.5.patch, HIVE-21240.6.patch, > HIVE-21240.7.patch, HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-15475) JsonSerDe cannot handle json file with empty lines
[ https://issues.apache.org/jira/browse/HIVE-15475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776298#comment-16776298 ] BELUGA BEHR commented on HIVE-15475: Nope. OK. Figured it out. This issue was inadvertently fixed as part of [HIVE-18545] (Jul 10, 2018). Previous to this change, the JSON stuff was handled by {{org.apache.hive.hcatalog.data.JsonSerDe}} The issue was that this class was not handling the provided {{Text}} object correctly. The {{Text}} object has two components to it: an internal array of bytes *and* a size that indicates which bytes are to be processed. Well, {{JsonSerde}} was not taking into account the size, so, when a zero-length {{Text}} object was submitted, it would still look at the entire internal byte array, ignoring the zero size, and produce duplicates where there should be no text. https://github.com/apache/hive/blob/ae008b79b5d52ed6a38875b73025a505725828eb/hcatalog/core/src/main/java/org/apache/hive/hcatalog/data/JsonSerDe.java#L168 > JsonSerDe cannot handle json file with empty lines > -- > > Key: HIVE-15475 > URL: https://issues.apache.org/jira/browse/HIVE-15475 > Project: Hive > Issue Type: Bug > Components: HCatalog >Affects Versions: 1.2.1 >Reporter: pin_zhang >Priority: Major > > 1. start HiveServer2 in apache-hive-1.2.1 > 2 start a beeline connect to hive server2 > ADD JAR ADD JAR > /home/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar > ; >CREATE external TABLE my_table(a string, b bigint) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > STORED AS TEXTFILE > location 'file:///home/hive/json'; > 3 put a file with more than one new lines at the end of the file > {"a":"a_1", "b" : 1} > 4 run sql > select * from my_table ; > +-+-+--+ > | my_table.a | my_table.b | > +-+-+--+ > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > +-+-+--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (HIVE-15475) JsonSerDe cannot handle json file with empty lines
[ https://issues.apache.org/jira/browse/HIVE-15475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR resolved HIVE-15475. Resolution: Fixed > JsonSerDe cannot handle json file with empty lines > -- > > Key: HIVE-15475 > URL: https://issues.apache.org/jira/browse/HIVE-15475 > Project: Hive > Issue Type: Bug > Components: HCatalog >Affects Versions: 1.2.1 >Reporter: pin_zhang >Priority: Major > > 1. start HiveServer2 in apache-hive-1.2.1 > 2 start a beeline connect to hive server2 > ADD JAR ADD JAR > /home/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar > ; >CREATE external TABLE my_table(a string, b bigint) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > STORED AS TEXTFILE > location 'file:///home/hive/json'; > 3 put a file with more than one new lines at the end of the file > {"a":"a_1", "b" : 1} > 4 run sql > select * from my_table ; > +-+-+--+ > | my_table.a | my_table.b | > +-+-+--+ > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > +-+-+--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-15475) JsonSerDe cannot handle json file with empty lines
[ https://issues.apache.org/jira/browse/HIVE-15475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775965#comment-16775965 ] BELUGA BEHR edited comment on HIVE-15475 at 2/23/19 6:16 PM: - I've been digging into this as part of HIVE-21240. I'm pretty sure that this is related to [MAPREDUCE-6549], [MAPREDUCE-6481], [MAPREDUCE-6558] which have all been fixed in Hadoop 2.6.3/2.6.5 However, Hive 2.1 uses Hadoop 2.6.1: https://github.com/apache/hive/blob/rel/release-2.1.1/pom.xml#L135 You have to use Hive 2.2.0 or higher: https://github.com/apache/hive/blob/rel/release-2.2.0/pom.xml#L141 was (Author: belugabehr): I've been digging into this as part of HIVE-21240. I'm pretty sure that this is related to [MAPREDUCE-6549], [MAPREDUCE-6481], [MAPREDUCE-6558] which have all been fixed in Hadoop 2.6.3/2.6.5 However, Hive 2.1 uses Hadoop 2.6.1: https://github.com/apache/hive/blob/rel/release-2.1.1/pom.xml#L135 You have to use Hive 2.2.1 or higher: https://github.com/apache/hive/blob/rel/release-2.2.0/pom.xml#L141 > JsonSerDe cannot handle json file with empty lines > -- > > Key: HIVE-15475 > URL: https://issues.apache.org/jira/browse/HIVE-15475 > Project: Hive > Issue Type: Bug > Components: HCatalog >Affects Versions: 1.2.1 >Reporter: pin_zhang >Priority: Major > > 1. start HiveServer2 in apache-hive-1.2.1 > 2 start a beeline connect to hive server2 > ADD JAR ADD JAR > /home/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar > ; >CREATE external TABLE my_table(a string, b bigint) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > STORED AS TEXTFILE > location 'file:///home/hive/json'; > 3 put a file with more than one new lines at the end of the file > {"a":"a_1", "b" : 1} > 4 run sql > select * from my_table ; > +-+-+--+ > | my_table.a | my_table.b | > +-+-+--+ > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > +-+-+--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-15475) JsonSerDe cannot handle json file with empty lines
[ https://issues.apache.org/jira/browse/HIVE-15475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775965#comment-16775965 ] BELUGA BEHR commented on HIVE-15475: I've been digging into this as part of HIVE-21240. I'm pretty sure that this is related to [MAPREDUCE-6549], [MAPREDUCE-6481], [MAPREDUCE-6558] which have all been fixed in Hadoop 2.6.3/2.6.5 However, Hive 2.1 uses Hadoop 2.6.1: https://github.com/apache/hive/blob/rel/release-2.1.1/pom.xml#L135 You have to use Hive 2.2.1 or higher: https://github.com/apache/hive/blob/rel/release-2.2.0/pom.xml#L141 > JsonSerDe cannot handle json file with empty lines > -- > > Key: HIVE-15475 > URL: https://issues.apache.org/jira/browse/HIVE-15475 > Project: Hive > Issue Type: Bug > Components: HCatalog >Affects Versions: 1.2.1 >Reporter: pin_zhang >Priority: Major > > 1. start HiveServer2 in apache-hive-1.2.1 > 2 start a beeline connect to hive server2 > ADD JAR ADD JAR > /home/apache-hive-1.2.1-bin/hcatalog/share/hcatalog/hive-hcatalog-core-1.2.1.jar > ; >CREATE external TABLE my_table(a string, b bigint) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > STORED AS TEXTFILE > location 'file:///home/hive/json'; > 3 put a file with more than one new lines at the end of the file > {"a":"a_1", "b" : 1} > 4 run sql > select * from my_table ; > +-+-+--+ > | my_table.a | my_table.b | > +-+-+--+ > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > | a_1 | 1 | > +-+-+--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21246) Un-bury DelimitedJSONSerDe from PlanUtils.java
[ https://issues.apache.org/jira/browse/HIVE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775955#comment-16775955 ] BELUGA BEHR commented on HIVE-21246: [~ngangam] [~pvary] :) > Un-bury DelimitedJSONSerDe from PlanUtils.java > -- > > Key: HIVE-21246 > URL: https://issues.apache.org/jira/browse/HIVE-21246 > Project: Hive > Issue Type: Improvement >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: HIVE-21246.1.patch, HIVE-21246.1.patch, > HIVE-21246.1.patch > > > Ultimately, I'd like to get rid of > {{org.apache.hadoop.hive.serde2.DelimitedJSONSerDe}}, but for now, trying to > make it easier to get rid of later. It's currently buried in > {{PlanUtils.java}}. > A SerDe and a boolean flag gets passed into these methods. If the flag is > set to true, the specified SerDe is overwritten and assigned to > {{DelimitedJSONSerDe}}. This is not documented anywhere and it's a weird > thing to do, just pass in the required SerDe from the start instead of > sending the wrong SerDe and a flag to overwrite it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21303: --- Attachment: (was: HIVE-21303.1.patch) > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775944#comment-16775944 ] BELUGA BEHR commented on HIVE-21303: [~pvary] [~ngangam] :) > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: HIVE-21240.9.patch > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Patch Available (was: Open) Added patch to fix JSON writer when using derived column names (_c0, _c1, etc.) OK. So, the Kafka_Handler Q-Test fails locally on trunk as well, so please ignore that UT failure. If Jenkins comes back clean, please consider accepting [^HIVE-21240.9.patch] for inclusion into the project. Reads with this SerDe are a bit quicker, writes, a bit slower. I'm not exactly sure what makes the reads faster, but the slower writes are expected as the writer more fully utilizes the Jackson library whereas the current implementation uses its own writing mechanisms that is very lightweight. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.9.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: (was: HIVE-24240.8.patch) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: (was: HIVE-24240.8.patch) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Open (was: Patch Available) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: (was: HIVE-24240.8.patch) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: (was: HIVE-21240.8.patch) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: (was: HIVE-21240.8.patch) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Comment: was deleted (was: Though, I am getting a failure in some scenarios that are not picked up in the UTs. I need to investigate them further.) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Comment: was deleted (was: OK, I figured out the issue. I am running this SerDe in CDH 6.1 (based on Hive 2.2) and it fails with a version-mismatch issue when handling dates. This patch contains a JsonSerDe which is faster (read) and more feature rich than the existing JsonSerde. Please accept the latest patch for inclusion into the project.) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775543#comment-16775543 ] BELUGA BEHR edited comment on HIVE-21240 at 2/22/19 7:41 PM: - OK, I figured out the issue. I am running this SerDe in CDH 6.1 (based on Hive 2.2) and it fails with a version-mismatch issue when handling dates. This patch contains a JsonSerDe which is faster (read) and more feature rich than the existing JsonSerde. Please accept the latest patch for inclusion into the project. was (Author: belugabehr): OK, I figured out the issue. I am running this SerDe in CDH 6.1 and it fails with a version-mismatch issue when handling dates. This patch contains a JsonSerDe which is faster (read) and more feature rich than the existing JsonSerde. Please accept the latest patch for inclusion into the project. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775543#comment-16775543 ] BELUGA BEHR commented on HIVE-21240: OK, I figured out the issue. I am running this SerDe in CDH 6.1 and it fails with a version-mismatch issue when handling dates. This patch contains a JsonSerDe which is faster (read) and more feature rich than the existing JsonSerde. Please accept the latest patch for inclusion into the project. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775343#comment-16775343 ] BELUGA BEHR commented on HIVE-21240: Though, I am getting a failure in some scenarios that are not picked up in the UTs. I need to investigate them further. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775331#comment-16775331 ] BELUGA BEHR commented on HIVE-21240: Read Performance 195 million JSON records (String, int, float, Date) JSON-Trunk: 160s JSON-21240: 147s > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774730#comment-16774730 ] BELUGA BEHR commented on HIVE-21240: OK. My branch was a bit behind the trunk. Kafka Handler is using JSON SerDe so I will need to look more closely to see if these unit test failures are related. > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21303: --- Status: Open (was: Patch Available) > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch, HIVE-21303.1.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21303: --- Attachment: HIVE-21303.1.patch > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch, HIVE-21303.1.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21303: --- Status: Patch Available (was: Open) > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch, HIVE-21303.1.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21303: --- Attachment: HIVE-21303.1.patch > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR reassigned HIVE-21303: -- > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21303) Update TextRecordReader
[ https://issues.apache.org/jira/browse/HIVE-21303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21303: --- Status: Patch Available (was: Open) > Update TextRecordReader > --- > > Key: HIVE-21303 > URL: https://issues.apache.org/jira/browse/HIVE-21303 > Project: Hive > Issue Type: Improvement > Components: Query Processor >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Attachments: HIVE-21303.1.patch > > > Remove use of Deprecated > {{org.apache.hadoop.mapred.LineRecordReader.LineReader}} > For every call to {{next}}, the code dives into the configuration map to see > if this feature is enabled. Just look it up once and cache the value. > {code:java} > public int next(Writable row) throws IOException { > ... > if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVESCRIPTESCAPE)) { > return HiveUtils.unescapeText((Text) row); > } > return bytesConsumed; > } > {code} > Other clean up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21275) Lower Logging Level in Operator Class
[ https://issues.apache.org/jira/browse/HIVE-21275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774110#comment-16774110 ] BELUGA BEHR commented on HIVE-21275: [~ngangam] [~pvary] Please review :) > Lower Logging Level in Operator Class > - > > Key: HIVE-21275 > URL: https://issues.apache.org/jira/browse/HIVE-21275 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Fix For: 4.0.0, 3.2.0 > > Attachments: HIVE-21275.1.patch > > > There is an incredible amount of logging generated by the {{Operator}} during > the Q-Tests. > I counted more than 1 *million* lines of pretty useless logging. Please > lower to TRACE level. > {code} > 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.JoinOperator: Starting > group > 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.FileSinkOperator: > Starting group > 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.JoinOperator: Starting > group > 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.FileSinkOperator: > Starting group > 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.JoinOperator: Starting > group > 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.FileSinkOperator: > Starting group > 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.JoinOperator: Starting > group > 2019-02-14T14:25:31,612 DEBUG [pool-69-thread-1] exec.FileSinkOperator: > Starting group > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21264) Improvements Around CharTypeInfo
[ https://issues.apache.org/jira/browse/HIVE-21264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774111#comment-16774111 ] BELUGA BEHR commented on HIVE-21264: [~ngangam] [~pvary] Please review :) > Improvements Around CharTypeInfo > > > Key: HIVE-21264 > URL: https://issues.apache.org/jira/browse/HIVE-21264 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0, 3.2.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Minor > Attachments: HIVE-21264.1.patch, HIVE-21264.2.patch > > > The {{CharTypeInfo}} stores the type name of the data type (char/varchar) and > the length (1-255). {{CharTypeInfo}} objects are often getting cached once > they are created. > The {{hashcode()}} and {{equals()}} of its sub-classes varchar and char are > inconsistent. > * Make hashcode and equals consistent (and fast) > * Simplify the {{getQualifiedName}} implementation and reduce the scope to > protected > * Other related nits -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21289) Expect EQ and LIKE to Generate the Identical Explain Plans
[ https://issues.apache.org/jira/browse/HIVE-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21289: --- Description: I generated some test data with the UUID function. {code:sql} explain select * from test_like where a like 'abce6254-d437-426b-8873-2cbc153ddfbc'; explain select * from test_like where a = 'abce6254-d437-426b-8873-2cbc153ddfbc'; {code} {code} Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: test_like filterExpr: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: boolean) Statistics: Num rows: 262144 Data size: 9437184 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: boolean) Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: string) outputColumnNames: _col0 Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} {code} Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: test_like filterExpr: (a = 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: boolean) Statistics: Num rows: 262144 Data size: 9437184 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (a = 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: boolean) Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 'abce6254-d437-426b-8873-2cbc153ddfbc' (type: string) outputColumnNames: _col0 Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} They may be the same under the covers, but I would expect the EXPLAIN plan to be exactly the same. was: I generated some test data with the UUID function. {code:sql} explain select * from test_like where a like 'abce6254-d437-426b-8873-2cbc153ddfbc'; explain select * from test_like where a = 'abce6254-d437-426b-8873-2cbc153ddfbc'; {code} {code|title=LIKE} Explain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: test_like filterExpr: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: boolean) Statistics: Num rows: 262144 Data size: 9437184 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (a like 'abce6254-d437-426b-8873-2cbc153ddfbc') (type: boolean) Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: a (type: string) outputColumnNames: _col0 Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 131072 Data size: 4718592 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage:
[jira] [Commented] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771441#comment-16771441 ] BELUGA BEHR commented on HIVE-21240: I do not believe this failed unit test is related. Please consider the latest patch for inclusion into the project. [^HIVE-24240.8.patch] {code:java} 2019-02-18T14:55:57,783 DEBUG [pool-17-thread-1] clients.NetworkClient: [Consumer clientId=958935173, groupId=] Initiating connection to node localhost:9093 (id: -1 rack: null) 2019-02-18T14:55:57,785 DEBUG [pool-17-thread-1] network.Selector: [Consumer clientId=958935173, groupId=] Connection with localhost/127.0.0.1 disconnected java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_191] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_191] at org.apache.kafka.common.network.PlaintextTransportLayer.finishConnect(PlaintextTransportLayer.java:50) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.kafka.common.network.KafkaChannel.finishConnect(KafkaChannel.java:152) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:471) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.kafka.common.network.Selector.poll(Selector.java:425) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:271) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:242) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:218) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:274) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1774) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1742) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.kafka.KafkaInputFormat.fetchTopicPartitions(KafkaInputFormat.java:189) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.kafka.KafkaInputFormat.lambda$buildFullScanFromKafka$0(KafkaInputFormat.java:96) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.kafka.RetryUtils.retry(RetryUtils.java:93) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.kafka.RetryUtils.retry(RetryUtils.java:116) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.kafka.RetryUtils.retry(RetryUtils.java:109) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.kafka.KafkaInputFormat.buildFullScanFromKafka(KafkaInputFormat.java:98) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.kafka.KafkaInputFormat.lambda$computeSplits$5(KafkaInputFormat.java:135) ~[kafka-handler-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_191] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191] 2019-02-18T14:55:57,787 DEBUG [pool-17-thread-1] clients.NetworkClient: [Consumer clientId=958935173, groupId=] Node -1 disconnected. 2019-02-18T14:55:57,787 WARN [pool-17-thread-1] clients.NetworkClient: [Consumer clientId=958935173, groupId=] Connection to node -1 could not be established. Broker may not be available. 2019-02-18T14:55:57,787 DEBUG [pool-17-thread-1] internals.ConsumerNetworkClient: [Consumer clientId=958935173, groupId=] Cancelled request with header RequestHeader(apiKey=METADATA, apiVersion=6, clientId=958935173, correlationId=32) due to node -1 being disconnected 2019-02-18T14:55:57,888 DEBUG [pool-17-thread-1] clients.NetworkClient: [Consumer clientId=958935173, groupId=] Give up sending metadata request since no node is available 2019-02-18T14:55:57,990 DEBUG [pool-17-thread-1] clients.NetworkClient: [Consumer clientId=958935173, groupId=] Give up sending metadata request since no node is available 2019-02-18T14:55:58,056 DEBUG
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Patch Available (was: Open) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Open (was: Patch Available) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: HIVE-21240.8.patch > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-21240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Patch Available (was: Open) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Attachment: HIVE-21240.8.patch > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 4.0.0, 3.1.1 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-21240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-21240) JSON SerDe Re-Write
[ https://issues.apache.org/jira/browse/HIVE-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BELUGA BEHR updated HIVE-21240: --- Status: Open (was: Patch Available) > JSON SerDe Re-Write > --- > > Key: HIVE-21240 > URL: https://issues.apache.org/jira/browse/HIVE-21240 > Project: Hive > Issue Type: Improvement > Components: Serializers/Deserializers >Affects Versions: 3.1.1, 4.0.0 >Reporter: BELUGA BEHR >Assignee: BELUGA BEHR >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, > HIVE-21240.2.patch, HIVE-21240.3.patch, HIVE-21240.4.patch, > HIVE-21240.5.patch, HIVE-21240.6.patch, HIVE-21240.7.patch, > HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch, HIVE-24240.8.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The JSON SerDe has a few issues, I will link them to this JIRA. > * Use Jackson Tree parser instead of manually parsing > * Added support for base-64 encoded data (the expected format when using JSON) > * Added support to skip blank lines (returns all columns as null values) > * Current JSON parser accepts, but does not apply, custom timestamp formats > in most cases > * Added some unit tests > * Added cache for column-name to column-index searches, currently O\(n\) for > each row processed, for each column in the row -- This message was sent by Atlassian JIRA (v7.6.3#76005)