Re: Writing an array of multiple different Record types to Avro format, into the same file

Horváth Péter Gergely Wed, 20 Jun 2018 08:48:02 -0700

Hi Nandor,

Yes, thank you, that works perfectly, even after deserializing, the class
of the subtype is correctly restored. Instead of using subtypes, we now
have a consistent wrapper for everything: that is SMART! :)


I think this is maybe an edge case, but maybe it would be worth adding a
little bit of hint regarding this to the official documentation.

Once again, thank you for your help.

Regards,
Peter



2018-06-20 16:56 GMT+02:00 Nandor Kollar <[email protected]>:

> No, I was thinking of something like this:
>
> {
>   "namespace": "com.foobar",
>   "name": "UnionRecords",
>   "type": "array",
>   "items": {
>     "type": "record",
>     "name": "RecordWithCommonFields",
>     "fields": [
>       {"name": "commonField1", "type": "string"},
>       {"name": "commonField2", "type": "string"},
>       {"name": "subtype", "type": [
>         {
>           "type" : "record",
>           "name": "RecordTypeA",
>           "fields" : [
>             {"name": "integerSpecificToA1", "type": ["null", "long"] },
>             {"name": "stringSpecificToA1", "type": ["null", "string"]}
>           ]
>         },
>         {
>           "type" : "record",
>           "name": "RecordTypeB",
>           "fields" : [
>             {"name": "booleanSpecificToB1", "type": ["null", "boolean"]},
>             {"name": "stringSpecificToB1", "type": ["null", "string"]}
>           ]
>         }
>       ]}
>     ]
>   }
> }
>
> This schema represents an array of records, each record has two mandatory
> field (the common type fields), and one field for the subtypes. Latter
> (named as subtype) is a union field (non nullable, mandatory) of
> RecordTypeA and RecordTypeB records each record with the subtype specific
> fields. Does this solve your use case?
>
> Regards,
> Nandor
>
> On Wed, Jun 20, 2018 at 4:34 PM, Horváth Péter Gergely <
> [email protected]> wrote:
>
>> Hi Nandor,
>>
>> Thank you for your suggestion. Do you mean something like this:
>>
>> [
>>   {
>>     "namespace": "com.foobar",
>>     "name": "UnionRecords",
>>     "type": "array",
>>     "items": {
>>       "type": "record",
>>       "name": "UnionRecord",
>>       "fields": [
>>         {"name": "commonField1", "type": "string"},
>>         {"name": "commonField2", "type": "string"},
>>         {"name": "integerSpecificToA1", "type": ["null", "long"] },
>>         {"name": "stringSpecificToA1", "type": ["null", "string"]},
>>         {"name": "booleanSpecificToB1", "type": ["null", "boolean"]},
>>         {"name": "stringSpecificToB1", "type": ["null", "string"]}
>>       ]
>>     }
>>   }
>> ]
>>
>> How would you make the distinction when the record is being read? That is
>> not clear to me. Could you please clarify that?
>>
>> Thanks,
>> Peter
>>
>>
>>
>> 2018-06-20 15:51 GMT+02:00 Nandor Kollar <[email protected]>:
>>
>>> Hi Peter,
>>>
>>> I think what you need is a union
>>> <https://avro.apache.org/docs/1.8.1/spec.html#Unions> of records. What
>>> comes to my mind is to create a record type with these fields: all common
>>> field (commonField1, commonField2) and an additional union field for
>>> the derived types (not nullable union, since your base class is abstract).
>>> The union is union of your concrete records: RecordTypeB (with the
>>> fields specific only for this derived type), RecordTypeA (with the
>>> fields specific only for this derived type).
>>>
>>> Regards,
>>> Nandor
>>>
>>> On Wed, Jun 20, 2018 at 3:35 PM, Horváth Péter Gergely <
>>> [email protected]> wrote:
>>>
>>>> Hi All,
>>>>
>>>> We have some legacy file format, which I would need to migrate to Avro
>>>> format. The tricky part is that the records basically have
>>>>
>>>>    - some common fields,
>>>>    - a discriminator field and
>>>>    - some unique fields, specific to the type selected by the
>>>>    discriminator field
>>>>
>>>> all of them is stored in the same file, without any order, mixed with
>>>> each other.
>>>>
>>>> In Java/object-oriented programming, one could represent our records
>>>> concept as the following:
>>>>
>>>> abstract class RecordWithCommonFields {
>>>>    private Long commonField1;
>>>>    private String commonField2;
>>>>    ...
>>>> }
>>>>
>>>> class RecordTypeA extends RecordWithCommonFields {
>>>>    private Integer specificToA1;
>>>>    private String specificToA1;
>>>>    ...
>>>> }
>>>>
>>>> class RecordTypeB extends RecordWithCommonFields {
>>>>    private Boolean specificToB1;
>>>>    private String specificToB1;
>>>>    ...
>>>> }
>>>>
>>>> Imagine the data being something like this:
>>>>
>>>> commonField1Value;commonField2Value,TYPE_IS_A,specificToA1Va
>>>> lue,specificToA1Value
>>>> commonField1Value;commonField2Value,TYPE_IS_B,specificToB1Va
>>>> lue,specificToB1Value
>>>>
>>>> So I would like to process an incoming file and write its content to
>>>> Avro format, somehow representing the different types of the records:
>>>> technically this would be an array, which should hold different types of
>>>> records.
>>>>
>>>> Can someone give me some ideas on how to achieve this?
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>>
>>>
>>
>

Re: Writing an array of multiple different Record types to Avro format, into the same file

Reply via email to