I think the specification is clear about that. Unions > A union is encoded by first writing a long value indicating the > zero-based position within the union of the schema of its value. The value > is then encoded per the indicated schema within the union. > For example, the union schema ["string","null"] would encode: > > - null as the integer 1 (the index of "null" in the union, encoded as > hex 02): > > 02 > > - the string "a" as zero (the index of "string" in the union), > followed by the serialized string: > > 00 02 61
http://avro.apache.org/docs/1.7.6/spec.html So there is an overhead but that may not be the main issue. The issue might be more about defining a correct schema. If a field can be null then all clients should handle the case when the field is indeed null. That's a 'hygiene issue' (or data quality issue if your prefer), like with a database schema. Regards Bertrand Bertrand Dechoux On Fri, Mar 14, 2014 at 9:15 AM, Fengyun RAO <[email protected]> wrote: > I have some string fields which may be null, while some definitely not > null. > The problem is that it takes time to distinguish them. > There are about 100 fields, 50 of which are string, 10 of which I guess > could be null. > > Could I just specify all string types ["string", "null"], > how much is the efficiency difference? > > >
