The fix was this:
{
"type":"record",
"name":"Email",
"fields":
[
{
"name":"message_id",
"type":["null","string"],
"doc":""
},
{
"name":"in_reply_to",
"type": ["string", "null"]
},
{
"name":"subject",
"type": ["string", "null"]
},
{
"name":"body",
"type": ["string", "null"]
},
{
"name":"date",
"type": ["string", "null"]
},
{
"name":"froms",
"type":
[
"null",
{
"type":"array",
"items":
[
"null",
{
"type":"record",
"name":"from",
"fields":
[
{
"name":"real_name",
"type":["null","string"],
"doc":""
},
{
"name":"address",
"type":["null","string"],
"doc":""
}
]
}
]
}
],
"doc":""
},
{
"name":"tos",
"type":
[
"null",
{
"type":"array",
"items":
[
"null",
{
"type":"record",
"name":"to",
"fields":
[
{
"name":"real_name",
"type":["null","string"],
"doc":""
},
{
"name":"address",
"type":["null","string"],
"doc":""
}
]
}
]
}
],
"doc":""
},
{
"name":"ccs",
"type":
[
"null",
{
"type":"array",
"items":
[
"null",
{
"type":"record",
"name":"cc",
"fields":
[
{
"name":"real_name",
"type":["null","string"],
"doc":""
},
{
"name":"address",
"type":["null","string"],
"doc":""
}
]
}
]
}
],
"doc":""
},
{
"name":"bccs",
"type":
[
"null",
{
"type":"array",
"items":
[
"null",
{
"type":"record",
"name":"bcc",
"fields":
[
{
"name":"real_name",
"type":["null","string"],
"doc":""
},
{
"name":"address",
"type":["null","string"],
"doc":""
}
]
}
]
}
],
"doc":""
},
{
"name":"reply_tos",
"type":
[
"null",
{
"type":"array",
"items":
[
"null",
{
"type":"record",
"name":"reply_to",
"fields":
[
{
"name":"real_name",
"type":["null","string"],
"doc":""
},
{
"name":"address",
"type":["null","string"],
"doc":""
}
]
}
]
}
],
"doc":""
}
]
}
On Tue, Apr 10, 2012 at 2:36 AM, Russell Jurney <[email protected]>
wrote:
Hmmmm unable to get this to work:
{
"namespace": "agile.data.avro",
"name": "Email",
"type": "record",
"fields": [
{"name":"message_id", "type": ["string", "null"]},
{"name":"froms","type": [{"type":"record", "name":"from", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
{"name":"tos","type": [{"type":"record", "name":"to", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
{"name":"ccs","type": [{"type":"record", "name":"cc", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
{"name":"bccs","type": [{"type":"record", "name":"bcc", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
{"name":"reply_tos","type": [{"type":"record", "name":"reply_to",
"fields": [{"type":"array", "items":"string"}, "null"]}, "null"]},
{"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
{"name":"subject", "type": ["string", "null"]},
{"name":"body", "type": ["string", "null"]},
{"name":"date", "type": ["string", "null"]}
]
}
On Tue, Apr 10, 2012 at 2:26 AM, Russell Jurney <[email protected]>
wrote:
In thinking about it more... it seems that unfortunately, the only thing I can
really do is to change the schema for all email address fields:
{"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
to:
{"name":"froms","type": [{"type":"record", "name":"from", "fields":
[{"type":"array", "items":"string"}, "null"]}, "null"]},
That is, to pluralize everything and then individually name array elements. I
will try running this through my stack.
On Mon, Apr 2, 2012 at 9:13 AM, Scott Carey <[email protected]> wrote:
It appears as though the Avro to PigStorage schema translation names (in pig)
all arrays ARRAY_ELEM. The nullable wrapper is 'visible' and the field name is
not moved onto the bag name.
About a year and a half ago I started
https://issues.apache.org/jira/browse/AVRO-592
but before finishing it AvroStorage was written elsewhere. I don't recall
exactly what I did with the schema translation there, but I recall the mapping
from an Avro schema to pig tried to hide the nullable wrappers more.
In Avro, arrays are unnamed types, so I see two things you could probably do
without any code changes:
* Add a line in the pig script to project / rename the fields to what you want
(unfortunate and clumbsy, but I think it will work — I think you want
"from::PIG_WRAPPER::ARRAY_ELEM as from" or
"FLATTEN(from::PIG_WRAPPER)::ARRAY_ELEM as from" something like that.
* Add a record wrapper to your schema (which may inject more messiness in the
pig schema view):
{
"namespace": "agile.data.avro",
"name": "Email",
"type": "record",
"fields": [
{"name":"message_id", "type": ["string", "null"]},
{"name":"from","type": [{"type":"record", "name":"From", "fields":
[[{"type":"array", "items":"string"},"null"]], "null"]},
…
]
}
But that is very awkward — requiring a named record for each field that is an
unnamed type.
Ideally PigStorage would treat any union of null and one other thing as a
simple pig type with no wrapper, and project the name of a field or record into
the name of the thing inside a bag.
-Scott
On 3/29/12 6:05 PM, "Russell Jurney" <[email protected]> wrote:
Is it possible to name string elements in the schema of an array?
Specifically, below I want to name the email addresses in the
from/to/cc/bcc/reply_to fields, so they don't get auto-named ARRAY_ELEM by
Pig's AvroStorage. I know I can probably fix this in Java in the Pig
AvroStorage UDF, but I'm hoping I can also fix it more easily in the schema.
Last time I read Avro's array docs in this context, my hit-points dropped by a
third, so pardom me if I've not rtfm this time :)
Complete description of what I'm doing follows:
Avro schema for my emails:
{
"namespace": "agile.data.avro",
"name": "Email",
"type": "record",
"fields": [
{"name":"message_id", "type": ["string", "null"]},
{"name":"from","type": [{"type":"array", "items":"string"}, "null"]},
{"name":"to","type": [{"type":"array", "items":"string"}, "null"]},
{"name":"cc","type": [{"type":"array", "items":"string"}, "null"]},
{"name":"bcc","type": [{"type":"array", "items":"string"}, "null"]},
{"name":"reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
{"name":"in_reply_to", "type": [{"type":"array", "items":"string"},
"null"]},
{"name":"subject", "type": ["string", "null"]},
{"name":"body", "type": ["string", "null"]},
{"name":"date", "type": ["string", "null"]}
]
}
Pig to publish my Avros:
grunt> emails = load '/me/tmp/emails' using AvroStorage();
grunt> describe emails
emails: {message_id: chararray,from: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},to:
{PIG_WRAPPER: (ARRAY_ELEM: chararray)},cc: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},bcc: {PIG_WRAPPER: (ARRAY_ELEM: chararray)},reply_to: {PIG_WRAPPER:
(ARRAY_ELEM: chararray)},in_reply_to: {PIG_WRAPPER: (ARRAY_ELEM:
chararray)},subject: chararray,body: chararray,date: chararray}
grunt> store emails into 'mongodb://localhost/agile_data.emails' using
MongoStorage();
My emails in MongoDB:
> db.emails.findOne()
{
"_id" : ObjectId("4f738a35414e113e75707b97"),
"message_id" : "<[email protected]>",
"from" : [
{
"ARRAY_ELEM" : "[email protected]"
}
],
"to" : [
{
"ARRAY_ELEM" : "[email protected]"
}
],
"cc" : null,
"bcc" : null,
"reply_to" : null,
"in_reply_to" : null,
"subject" : "Daily Job Change Alerts from SalesLoft",
"body" : "Daily Job Change Alerts from SalesLoft",
"date" : "2012-03-27T08:00:29"
}
My email on screen:
My face when I see ARRAY_ELEM, because it means more complex presentation code:
:(
--
Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com
--
Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com
--
Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com
--
Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com