Re: Globbing several AVRO files with different (extended) schemes

Stan Rosenberg Wed, 21 Mar 2012 19:13:31 -0700

There is a patch for AvroStorage which computes a union schema thereby
allowing input avro files having different
schemas, specifically (un-nested) records with different fields.


https://issues.apache.org/jira/browse/PIG-2579

Best,

stan

On Wed, Mar 21, 2012 at 8:31 PM, Jonathan Coveney <[email protected]> wrote:
> A question about this: does Avro have clear cut rules for how to
> essentially merge two arbitrary JSON schemas?
>
> 2012/3/21 Jonathan Coveney <[email protected]>
>
>> ATM, there is no quick and easy solution short of patching Pig... feel
>> free to make a ticket.
>>
>> Short of that, what you can do is load each relation with a different
>> schema separately, and then do a union of it. Given that there might be a
>> lot of different relations and schemas involved, you could probably make a
>> script to do this for you... but yeah, the long term approach is to patch
>> AvroStorage.
>>
>>
>> 2012/3/21 Markus Resch <[email protected]>
>>
>>> Hi guys,
>>>
>>> Thanks again for your awesome hint about sqoop.
>>>
>>> I have another question: The data I'm working with is stored as AVRO
>>> Files in the Hadoop. When I try to glob them everything works just
>>> perfectly. But. When I add something to the schema of a single data file
>>> while the others remain, everything gets wrecked:
>>>
>>> "currently we assume all avro files under the same "location"
>>>     * share the same schema and will throw exception if not."
>>>
>>> (e.g. I add a new data field) Expected behavior for me would be: If I'm
>>> globbing several files with slightly different schema the result of the
>>> LOAD would be either return an intersection of all valid fields that are
>>> common to both schemes or the atoms of the missing fields are nulled.
>>>
>>> How could I handle this properly?
>>>
>>> Thanks
>>>
>>> Markus
>>>
>>>
>>>
>>

Re: Globbing several AVRO files with different (extended) schemes

Reply via email to