Mithun Radhakrishnan created HIVE-14789:
-------------------------------------------

             Summary: Avro Table-reads bork when using SerDe-generated 
table-schema.
                 Key: HIVE-14789
                 URL: https://issues.apache.org/jira/browse/HIVE-14789
             Project: Hive
          Issue Type: Bug
          Components: Serializers/Deserializers
    Affects Versions: 2.0.1, 1.2.1
            Reporter: Mithun Radhakrishnan


AvroSerDe allows one to skip the table-columns in a table-definition when 
creating a table, as long as the TBLPROPERTIES includes a valid 
{{avro.schema.url}} or {{avro.schema.literal}}. The table-columns are inferred 
from processing the Avro schema file/literal.

The problem is that the inferred schema might not be congruent with the actual 
schema in the Avro schema file/literal. Consider the following table definition:

{code:sql}
CREATE TABLE avro_schema_break_1
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
  "type": "record",
  "name": "Messages",
  "namespace": "net.myth",
  "fields": [
    {
      "name": "header",
      "type": [
        "null",
        {
          "type": "record",
          "name": "HeaderInfo",
          "fields": [
            {
              "name": "inferred_event_type",
              "type": [
                "null",
                "string"
              ],
              "default": null
            },
            {
              "name": "event_type",
              "type": [
                "null",
                "string"
              ],
              "default": null
            },
            {
              "name": "event_version",
              "type": [
                "null",
                "string"
              ],
              "default": null
            }    
          ]
        }
      ]
    },
    {
      "name": "messages",
      "type": {
        "type": "array",
        "items": {
          "name": "MessageInfo",
          "type": "record",
          "fields": [
            {
              "name": "message_id",
              "type": [
                "null",
                "string"
              ],
              "doc": "Message-ID"
            },
            {
              "name": "received_date",
              "type": [
                "null",
                "long"
              ],
              "doc": "Received Date"
            },
            {
              "name": "sent_date",
              "type": [
                "null",
                "long"
              ]
            },
            {
              "name": "from_name",
              "type": [
                "null",
                "string"
              ]
            },
            {
              "name": "flags",
              "type": [
                "null",
                {
                  "type": "record",
                  "name": "Flags",
                  "fields": [
                    {
                      "name": "is_seen",
                      "type": [
                        "null",
                        "boolean"
                      ],
                      "default": null
                    },
                    {
                      "name": "is_read",
                      "type": [
                        "null",
                        "boolean"
                      ],
                      "default": null
                    },
                    {
                      "name": "is_flagged",
                      "type": [
                        "null",
                        "boolean"
                      ],
                      "default": null
                    }
                  ]
                }
              ],
              "default": null
            }
          ]
        }
      }
    }
  ]
}');
{code}

This produces a table with the following schema:
{noformat}
2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] 
hive.log: DDL: struct avro_schema_break_1 { 
struct<inferred_event_type:string,event_type:string,event_version:string> 
header, 
list<struct<message_id:string,received_date:i64,sent_date:i64,from_name:string,flags:struct<is_seen:bool,is_read:bool,is_flagged:bool>>>
 messages}
{noformat}

Data written to this table using the AvroSchema from {{avro.schema.literal}} 
using Pig's {{AvroStorage}} cannot be read using Hive using the generated table 
schema. This is the exception one sees:

{noformat}
java.io.IOException: org.apache.avro.AvroTypeException: Found 
net.myth.HeaderInfo, expecting union
  at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
  at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
  at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
  at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2019)
  at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
  at 
org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1162)
  at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1136)
  at 
org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:172)
  at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104)
  at 
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver(TestCliDriver.java:59)
...
{noformat}

The only way to read this table is by using the attached 
{{avro.schema.literal}} or {{avro.schema.url}}. This has implications on 
systems where data could be produced externally to Hive. It also has 
repercussions on table-replication using Falcon/GDM, in that the schema 
file/literal needs to be replicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to