Mithun Radhakrishnan created HIVE-14789: -------------------------------------------
Summary: Avro Table-reads bork when using SerDe-generated table-schema. Key: HIVE-14789 URL: https://issues.apache.org/jira/browse/HIVE-14789 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Affects Versions: 2.0.1, 1.2.1 Reporter: Mithun Radhakrishnan AvroSerDe allows one to skip the table-columns in a table-definition when creating a table, as long as the TBLPROPERTIES includes a valid {{avro.schema.url}} or {{avro.schema.literal}}. The table-columns are inferred from processing the Avro schema file/literal. The problem is that the inferred schema might not be congruent with the actual schema in the Avro schema file/literal. Consider the following table definition: {code:sql} CREATE TABLE avro_schema_break_1 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.literal'='{ "type": "record", "name": "Messages", "namespace": "net.myth", "fields": [ { "name": "header", "type": [ "null", { "type": "record", "name": "HeaderInfo", "fields": [ { "name": "inferred_event_type", "type": [ "null", "string" ], "default": null }, { "name": "event_type", "type": [ "null", "string" ], "default": null }, { "name": "event_version", "type": [ "null", "string" ], "default": null } ] } ] }, { "name": "messages", "type": { "type": "array", "items": { "name": "MessageInfo", "type": "record", "fields": [ { "name": "message_id", "type": [ "null", "string" ], "doc": "Message-ID" }, { "name": "received_date", "type": [ "null", "long" ], "doc": "Received Date" }, { "name": "sent_date", "type": [ "null", "long" ] }, { "name": "from_name", "type": [ "null", "string" ] }, { "name": "flags", "type": [ "null", { "type": "record", "name": "Flags", "fields": [ { "name": "is_seen", "type": [ "null", "boolean" ], "default": null }, { "name": "is_read", "type": [ "null", "boolean" ], "default": null }, { "name": "is_flagged", "type": [ "null", "boolean" ], "default": null } ] } ], "default": null } ] } } } ] }'); {code} This produces a table with the following schema: {noformat} 2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] hive.log: DDL: struct avro_schema_break_1 { struct<inferred_event_type:string,event_type:string,event_version:string> header, list<struct<message_id:string,received_date:i64,sent_date:i64,from_name:string,flags:struct<is_seen:bool,is_read:bool,is_flagged:bool>>> messages} {noformat} Data written to this table using the AvroSchema from {{avro.schema.literal}} using Pig's {{AvroStorage}} cannot be read using Hive using the generated table schema. This is the exception one sees: {noformat} java.io.IOException: org.apache.avro.AvroTypeException: Found net.myth.HeaderInfo, expecting union at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2019) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1162) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1136) at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:172) at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104) at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver(TestCliDriver.java:59) ... {noformat} The only way to read this table is by using the attached {{avro.schema.literal}} or {{avro.schema.url}}. This has implications on systems where data could be produced externally to Hive. It also has repercussions on table-replication using Falcon/GDM, in that the schema file/literal needs to be replicated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)