For what it's worth, I've taken, for reasons including the tricky handling
in dynamic things, I've taken to defining "unions" in the Thrift or
Protocol Buffer style.  Instead of "union(A,B,C,D)", I do
"struct(union(null, A) a, union(null, B) b, union(null, C) c, union(null,
D) d").  Note that this implies certain storage inefficiencies.  I'm doing
this in RPC-land, which the extra few bytes aren't bothering me.

-- Philip


On Thu, Jun 5, 2014 at 11:00 AM, Grant Overby (groverby) <[email protected]
> wrote:

>   Sure, but that is kind of an unbounded question. Can you be more
> specific as to what you’re looking for?
>
>  Here’s a shot at an answer:
> Polymorphism is a weak spot for Avro; unions help get around that short
> coming. We have unions which contain multiple record specifications. The
> reference that has a union datatype in the schema could point to an
> instance of one of many classes at runtime with which class that is being
> known only at runtime.
>
>
>        *Grant Overby*
> Software Engineer
> Cisco.com
> [email protected]
> Mobile: *865 724 4910 <865%20724%204910>*
>
>
>
>        Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here
> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>
>   From: Wai Yip Tung <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Thursday, June 5, 2014 at 1:40 PM
>
> To: "[email protected]" <[email protected]>
> Subject: Re: Union resolution in dynamic languages
>
>  That's good to know. Would you mind sharing your use case with us?
>
> Wai Yip
>
>    Grant Overby (groverby) <[email protected]>
> Thursday, June 05, 2014 6:46 AM
>   Disallowing multiple named types within a union would break our use
> cases.
>
>  We have a similar problem. With two record types in a union, the Python
> driver doesn’t choose well.
>
>  We solved this problem by adding a pseudo-reserved key to the dict to
> indicate which named type to use. I started the process of open sourcing
> that patch a few days ago. It’s definitely a hack, but I’m hoping the
> community will accept it.
>
>  Our patch doesn’t change the time complexity. From a brief glance ,
> choosing within the union seems to typically be O(n) as the recursion short
> circuits. For named types, the complexity could be O(1). Achieving O(1) for
> non named types seems achievable too. How many projects are impacted by
> this ‘wasted’ complexity? Simpler code might be better than faster code.
>
>        *Grant Overby*
> Software Engineer
> Cisco.com
> [email protected]
> Mobile: *865 724 4910 <865%20724%204910>*
>
>
>
>
>      Think before you print.
>
> This email may contain confidential and privileged material for the sole
> use of the intended recipient. Any review, use, distribution or disclosure
> by others is strictly prohibited. If you are not the intended recipient (or
> authorized to receive for the recipient), please contact the sender by
> reply email and delete all copies of this message.
>
> Please click here
> <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for
> Company Registration Information.
>
>
>   From: Wai Yip Tung <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Wednesday, June 4, 2014 at 9:34 PM
> To: "[email protected]" <[email protected]>
> Subject: Re: Union resolution in dynamic languages
>
>  Also I ask about this in the context of building an optimized encoder. For
> this implementation, the resolution will be much simpler if we limit
> union to not support two records, similar to the spec do not allow two
> array or two map types. I wonder if this limit breaks any significant use
> case.
>
> Wai Yip
>     Wai Yip Tung <[email protected]>
> Wednesday, June 04, 2014 6:34 PM
>   Also I ask about this in the context of building an optimized encoder. For
> this implementation, the resolution will be much simpler if we limit
> union to not support two records, similar to the spec do not allow two
> array or two map types. I wonder if this limit breaks any significant use
> case.
>
> Wai Yip
>    Wai Yip Tung <[email protected]>
> Wednesday, June 04, 2014 4:40 PM
>   For encoding data of union type, the Avro specification do not say a
> lot which one of the type in the union is used. So far I am mostly using
> union so that I can write null or another simple type. In these cases, it
> is fairly obvious for the encoding to distinguish null from other types.
>
> However a union can also be any named types. So they can be two records.
> Let say a Manger record and a NonManager record. I think with strongly
> typed languages, the suitable type in the union can be selected by
> introspection. But for dynamic languages, these might just be a represented
> as maps without any notion of type. In some case, we may find that the
> object has all the attributes of a NonManager but not the Manager. So we
> can conclude NonManager is the proper schema to use. But this can get
> complicated with nested data structure where the attribute that can
> disambiguate thing appear in a deeper level. Or you can think of valid
> scenario where inspecting the content of the obj cannot unambiguously
> resolve the union branch.
>
> I notice that the Python implementation use two pass recursive validation
> possible for the reason of for resolving the union choice.
>
> I am wonder if there are much consideration about are potentially complex,
> indirectly nested union types that might be difficult to resolve? Thus
> adding complexity to the implementation of the encoders? Are there use case
> in practice that involve complex union decision?
>
> Wai Yip
>
>

Reply via email to