Re: [DISCUSS] Union Vector

2018-02-08 Thread Li Jin
Hi All,

I'd like to bump this thread for more discussion.

There is current a Java PR to make the current union type match the spec:
https://github.com/apache/arrow/pull/987. Since there is a need for a
"simple union" , i.e., a union that can only have one of each "minor type"
and have fixed type ids for each "minor type" (the current Java
implementation), we cannot just replace the current Java implementation. My
thoughts are we can try to have two Union classes in java and work out the
spec for the "simple union" type.

What do other people think?


On Thu, Jan 25, 2018 at 12:58 PM, Philipp Moritz  wrote:

> Hey Li,
>
> In Ray we need the second type of union, since there can be arbitrary
> nesting.
>
> -- Philipp.
>
> On Thu, Jan 25, 2018 at 8:56 AM, Li Jin  wrote:
>
> > Hi All,
> >
> > I'd like to bump this thread to get some more feedbacks from other
> people.
> > I think what Wes says makes sense, there seems to be two requirement for
> > union types and it might make sense to make them different types.
> >
> > I think Dremio has more use case for the first type of union. I think Ray
> > also has use case for union but I am not sure if it's closer to the first
> > or the second. How do people feel about spec out details for the first
> > union type?
> >
> > On Thu, Jan 11, 2018 at 2:39 PM, Wes McKinney 
> wrote:
> >
> > > hi all,
> > >
> > > So one of the conflicts that keeps coming up re: unions is the
> > > following two notions:
> > >
> > > * A union as a "variant of primitives" type. Here, values are
> > > constrained to be one of Arrow's primitive types (integer, floating
> > > point, string, boolean, etc.). The value types are statically declared
> > > and thus the union type codes have a fixed interpretation (e.g. 0 is
> > > always boolean, 1 always int8, etc. and so on).
> > >
> > > * A union as a composition of any child types (including nested
> > > types). In this model, a union internally is like a struct plus type
> > > codes, which refer to a collection of any fields, which may include
> > > other nested types
> > >
> > > IMHO, these are two different and totally valid things to support. The
> > > former can be viewed as a special case of the latter, but there are
> > > benefits to computation engines to rely on the assumptions of the
> > > former (like the type codes having a static interpretation rather than
> > > a dynamic one).
> > >
> > > Not having the latter union type seems troublesome to me. For example,
> > > other data serialization systems support this
> > >
> > > * oneof in Protocol Buffers
> > > https://developers.google.com/protocol-buffers/docs/proto#oneof
> > > * union in Flatbuffers https://google.github.io/
> > > flatbuffers/md__schemas.html
> > > * union in Thrift (not documented very well unfortunately)
> > > * union in Avro (I think this is the same)
> > >
> > > Thanks
> > > Wes
> > >
> > > On Thu, Jan 11, 2018 at 11:16 AM, Li Jin 
> wrote:
> > > > Hi All,
> > > >
> > > > Here is a summary of the state and issue of union vector (to the best
> > of
> > > my
> > > > knowledge).
> > > >
> > > > I have summarized some possible solutions based on the discussion so
> > far.
> > > > However, this is not a proposal as there are still a lot of things
> that
> > > are
> > > > not clear at this moment.
> > > >
> > > > I'd like to share this as a base for further discussion and move
> > towards
> > > a
> > > > proposal. Thank you.
> > > >
> > > > https://docs.google.com/document/d/1zSwSZDVxgmoDol_
> > > PKfyTDHD5wbw1eALs5eTS9kyjtYU/edit?usp=sharing
> > > >
> > > > Li
> > >
> >
>


Re: [DISCUSS] Union Vector

2018-01-25 Thread Philipp Moritz
Hey Li,

In Ray we need the second type of union, since there can be arbitrary
nesting.

-- Philipp.

On Thu, Jan 25, 2018 at 8:56 AM, Li Jin  wrote:

> Hi All,
>
> I'd like to bump this thread to get some more feedbacks from other people.
> I think what Wes says makes sense, there seems to be two requirement for
> union types and it might make sense to make them different types.
>
> I think Dremio has more use case for the first type of union. I think Ray
> also has use case for union but I am not sure if it's closer to the first
> or the second. How do people feel about spec out details for the first
> union type?
>
> On Thu, Jan 11, 2018 at 2:39 PM, Wes McKinney  wrote:
>
> > hi all,
> >
> > So one of the conflicts that keeps coming up re: unions is the
> > following two notions:
> >
> > * A union as a "variant of primitives" type. Here, values are
> > constrained to be one of Arrow's primitive types (integer, floating
> > point, string, boolean, etc.). The value types are statically declared
> > and thus the union type codes have a fixed interpretation (e.g. 0 is
> > always boolean, 1 always int8, etc. and so on).
> >
> > * A union as a composition of any child types (including nested
> > types). In this model, a union internally is like a struct plus type
> > codes, which refer to a collection of any fields, which may include
> > other nested types
> >
> > IMHO, these are two different and totally valid things to support. The
> > former can be viewed as a special case of the latter, but there are
> > benefits to computation engines to rely on the assumptions of the
> > former (like the type codes having a static interpretation rather than
> > a dynamic one).
> >
> > Not having the latter union type seems troublesome to me. For example,
> > other data serialization systems support this
> >
> > * oneof in Protocol Buffers
> > https://developers.google.com/protocol-buffers/docs/proto#oneof
> > * union in Flatbuffers https://google.github.io/
> > flatbuffers/md__schemas.html
> > * union in Thrift (not documented very well unfortunately)
> > * union in Avro (I think this is the same)
> >
> > Thanks
> > Wes
> >
> > On Thu, Jan 11, 2018 at 11:16 AM, Li Jin  wrote:
> > > Hi All,
> > >
> > > Here is a summary of the state and issue of union vector (to the best
> of
> > my
> > > knowledge).
> > >
> > > I have summarized some possible solutions based on the discussion so
> far.
> > > However, this is not a proposal as there are still a lot of things that
> > are
> > > not clear at this moment.
> > >
> > > I'd like to share this as a base for further discussion and move
> towards
> > a
> > > proposal. Thank you.
> > >
> > > https://docs.google.com/document/d/1zSwSZDVxgmoDol_
> > PKfyTDHD5wbw1eALs5eTS9kyjtYU/edit?usp=sharing
> > >
> > > Li
> >
>


Re: [DISCUSS] Union Vector

2018-01-25 Thread Li Jin
Hi All,

I'd like to bump this thread to get some more feedbacks from other people.
I think what Wes says makes sense, there seems to be two requirement for
union types and it might make sense to make them different types.

I think Dremio has more use case for the first type of union. I think Ray
also has use case for union but I am not sure if it's closer to the first
or the second. How do people feel about spec out details for the first
union type?

On Thu, Jan 11, 2018 at 2:39 PM, Wes McKinney  wrote:

> hi all,
>
> So one of the conflicts that keeps coming up re: unions is the
> following two notions:
>
> * A union as a "variant of primitives" type. Here, values are
> constrained to be one of Arrow's primitive types (integer, floating
> point, string, boolean, etc.). The value types are statically declared
> and thus the union type codes have a fixed interpretation (e.g. 0 is
> always boolean, 1 always int8, etc. and so on).
>
> * A union as a composition of any child types (including nested
> types). In this model, a union internally is like a struct plus type
> codes, which refer to a collection of any fields, which may include
> other nested types
>
> IMHO, these are two different and totally valid things to support. The
> former can be viewed as a special case of the latter, but there are
> benefits to computation engines to rely on the assumptions of the
> former (like the type codes having a static interpretation rather than
> a dynamic one).
>
> Not having the latter union type seems troublesome to me. For example,
> other data serialization systems support this
>
> * oneof in Protocol Buffers
> https://developers.google.com/protocol-buffers/docs/proto#oneof
> * union in Flatbuffers https://google.github.io/
> flatbuffers/md__schemas.html
> * union in Thrift (not documented very well unfortunately)
> * union in Avro (I think this is the same)
>
> Thanks
> Wes
>
> On Thu, Jan 11, 2018 at 11:16 AM, Li Jin  wrote:
> > Hi All,
> >
> > Here is a summary of the state and issue of union vector (to the best of
> my
> > knowledge).
> >
> > I have summarized some possible solutions based on the discussion so far.
> > However, this is not a proposal as there are still a lot of things that
> are
> > not clear at this moment.
> >
> > I'd like to share this as a base for further discussion and move towards
> a
> > proposal. Thank you.
> >
> > https://docs.google.com/document/d/1zSwSZDVxgmoDol_
> PKfyTDHD5wbw1eALs5eTS9kyjtYU/edit?usp=sharing
> >
> > Li
>