For what it's worth, I've taken, for reasons including the tricky handling in dynamic things, I've taken to defining "unions" in the Thrift or Protocol Buffer style. Instead of "union(A,B,C,D)", I do "struct(union(null, A) a, union(null, B) b, union(null, C) c, union(null, D) d"). Note that this implies certain storage inefficiencies. I'm doing this in RPC-land, which the extra few bytes aren't bothering me.
-- Philip On Thu, Jun 5, 2014 at 11:00 AM, Grant Overby (groverby) <[email protected] > wrote: > Sure, but that is kind of an unbounded question. Can you be more > specific as to what you’re looking for? > > Here’s a shot at an answer: > Polymorphism is a weak spot for Avro; unions help get around that short > coming. We have unions which contain multiple record specifications. The > reference that has a union datatype in the schema could point to an > instance of one of many classes at runtime with which class that is being > known only at runtime. > > > *Grant Overby* > Software Engineer > Cisco.com > [email protected] > Mobile: *865 724 4910 <865%20724%204910>* > > > > Think before you print. > > This email may contain confidential and privileged material for the sole > use of the intended recipient. Any review, use, distribution or disclosure > by others is strictly prohibited. If you are not the intended recipient (or > authorized to receive for the recipient), please contact the sender by > reply email and delete all copies of this message. > > Please click here > <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for > Company Registration Information. > > > > From: Wai Yip Tung <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Thursday, June 5, 2014 at 1:40 PM > > To: "[email protected]" <[email protected]> > Subject: Re: Union resolution in dynamic languages > > That's good to know. Would you mind sharing your use case with us? > > Wai Yip > > Grant Overby (groverby) <[email protected]> > Thursday, June 05, 2014 6:46 AM > Disallowing multiple named types within a union would break our use > cases. > > We have a similar problem. With two record types in a union, the Python > driver doesn’t choose well. > > We solved this problem by adding a pseudo-reserved key to the dict to > indicate which named type to use. I started the process of open sourcing > that patch a few days ago. It’s definitely a hack, but I’m hoping the > community will accept it. > > Our patch doesn’t change the time complexity. From a brief glance , > choosing within the union seems to typically be O(n) as the recursion short > circuits. For named types, the complexity could be O(1). Achieving O(1) for > non named types seems achievable too. How many projects are impacted by > this ‘wasted’ complexity? Simpler code might be better than faster code. > > *Grant Overby* > Software Engineer > Cisco.com > [email protected] > Mobile: *865 724 4910 <865%20724%204910>* > > > > > Think before you print. > > This email may contain confidential and privileged material for the sole > use of the intended recipient. Any review, use, distribution or disclosure > by others is strictly prohibited. If you are not the intended recipient (or > authorized to receive for the recipient), please contact the sender by > reply email and delete all copies of this message. > > Please click here > <http://www.cisco.com/web/about/doing_business/legal/cri/index.html> for > Company Registration Information. > > > From: Wai Yip Tung <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Wednesday, June 4, 2014 at 9:34 PM > To: "[email protected]" <[email protected]> > Subject: Re: Union resolution in dynamic languages > > Also I ask about this in the context of building an optimized encoder. For > this implementation, the resolution will be much simpler if we limit > union to not support two records, similar to the spec do not allow two > array or two map types. I wonder if this limit breaks any significant use > case. > > Wai Yip > Wai Yip Tung <[email protected]> > Wednesday, June 04, 2014 6:34 PM > Also I ask about this in the context of building an optimized encoder. For > this implementation, the resolution will be much simpler if we limit > union to not support two records, similar to the spec do not allow two > array or two map types. I wonder if this limit breaks any significant use > case. > > Wai Yip > Wai Yip Tung <[email protected]> > Wednesday, June 04, 2014 4:40 PM > For encoding data of union type, the Avro specification do not say a > lot which one of the type in the union is used. So far I am mostly using > union so that I can write null or another simple type. In these cases, it > is fairly obvious for the encoding to distinguish null from other types. > > However a union can also be any named types. So they can be two records. > Let say a Manger record and a NonManager record. I think with strongly > typed languages, the suitable type in the union can be selected by > introspection. But for dynamic languages, these might just be a represented > as maps without any notion of type. In some case, we may find that the > object has all the attributes of a NonManager but not the Manager. So we > can conclude NonManager is the proper schema to use. But this can get > complicated with nested data structure where the attribute that can > disambiguate thing appear in a deeper level. Or you can think of valid > scenario where inspecting the content of the obj cannot unambiguously > resolve the union branch. > > I notice that the Python implementation use two pass recursive validation > possible for the reason of for resolving the union choice. > > I am wonder if there are much consideration about are potentially complex, > indirectly nested union types that might be difficult to resolve? Thus > adding complexity to the implementation of the encoders? Are there use case > in practice that involve complex union decision? > > Wai Yip > >
