The Hive mailing list would have more info on the Avro SerDe usage.
In general, a system that does not have union types like Hive (or Pig,
etc) has to expand a union into multiple fields if there are more than one
non-null type -- and at most one branch of the union is not null.
For example a record with fields:
{"name":"timestamp", "type":"long", "default":-1}
{"name":"ipAddress", "type":["IPv4", "IPv6"]}
where IPv4 and IPv6 are previously defined types, would have to expand to
three fields
"timestamp", "ipAddress:IPv4", and "ipAddress:IPv6", where only one of
the last two is not null in any given record.
I do not know what Hive's Avro SerDe does with unions.
On 5/23/13 7:15 AM, "Ran S" <[email protected]> wrote:
>Hi,
>We started to work with Avro in CDH4 and to query the Avro files using
>Hive.
>This does work fine for us, except for unions.
>We do not understand how to query the data inside a union using Hive.
>
>For example, let's look at the following schema:
>
>{
> "type":"record",
> "name":"event",
> "namespace":"com.mysite",
> "fields":[
> {
> "name":"header",
> "type":{
> "type":"record", "name":"CommonHeader",
> "fields":[{ "name":"eventTimeStamp", "type":"long", efault":-1
>},
> { "name":"globalUserId", "type":["null", "string"],
>"default":null } ]
> },
> "default":null
> },
> {
> "name":"eventbody",
> "type":{
> "type":"record", "name":"eventbody",
> "fields":[
> {
> "name":"body",
> "type":[
> "null",
> {
> "type":"record",
> "name":"event1",
> "fields":[
> {
> "name":"event1Header",
> "type":["null", { "type":"array",
>"items":"string" }], "default":null
> },
> {
> "name":"event1Body",
> "type":["null", { "type":"array",
>"items":"string" }], "default":null
> }
> ]
> },
> {
> "type":"record",
> "name":"event2",
> "fields":[
> {
> "name":"page",
> "type":{
> "type":"record", "name":"URL",
>"fields":[{ "name":"url", "type":"string" }]
> },
> "default":null
> },
> {
> "name":"referrer", "type":"string",
>"default":null
> }
> ]
> }
> ],
> "default":null
> }
> ]
> },
> "default":null
> }
>]}
>
>Note that "body" is a union of three types:
>null, "event1" and "event2"
>
>So if I want to query fields inside event1, I first need to access it.
>I then set a HiveQL like this:
>SELECT eventbody.body.??? from SRC
>
>My question is: what shoule I put in the ??? above to make this work?
>
>Thank you,
>Ran
>
>
>
>--
>View this message in context:
>http://apache-avro.679487.n3.nabble.com/using-Avro-unions-with-HIVE-tp4027
>473.html
>Sent from the Avro - Users mailing list archive at Nabble.com.