[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986926#comment-14986926 ] ASF GitHub Bot commented on DRILL-3229: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/180 > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Hanifi Gunes > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986533#comment-14986533 ] ASF GitHub Bot commented on DRILL-3229: --- Github user jacques-n commented on the pull request: https://github.com/apache/drill/pull/180#issuecomment-153219549 This seems very useful to users as an experimental feature. +1 with the default behavior as disabled. > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Hanifi Gunes > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974674#comment-14974674 ] Parth Chandra commented on DRILL-3229: -- Just realised that with Untyped nulls, we would need to resolve the question of how we will handle schema only queries. We can't be sending back a schema with no type. > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Hanifi Gunes > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972246#comment-14972246 ] Parth Chandra commented on DRILL-3229: -- ... get a headstart on fixing the C++ client. (sorry for the break in transmission). > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Hanifi Gunes > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972238#comment-14972238 ] Parth Chandra commented on DRILL-3229: -- I think we do need to do the same thing as with complex types, see if the client supports complex types or not, and if not, then convert to JSON. Otherwise we will break the C++ client (fixing the C++ client for this would be painful, and useless since no consumer of the API can currently handle the Union type). Untyped nulls will also break he C++ client, though it is easier to fix that, so we probably should. IF you include a quick proposal on that, I an get a headstart on the > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Hanifi Gunes > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971766#comment-14971766 ] Steven Phillips commented on DRILL-3229: Regarding the list writer, I know it is a bit confusing, so I will try to give a better explanation for how it works. It confuses me at times as well. The type promotion was designed with the possibility of allowing other promotions in mind, but I am currently only doing promotion to Union. We should have a discussion about what other promotions we want to allow. Screen currently returns a Union type to the user. This is an area that will require additional enhancement. The DrillClient has no problem dealing with a Union vector. The jdbc driver, on the other hand, has only limited support for a Union type, currently. I think we might need to add a feature similar to what we have with complex types, which will determine if the client is able to handle Union types, and convert to json if it doesn't. So metadata queries will also return a Union type. As for case statements, I am leaning more toward a general philosophy of trying as much as we can to not fail queries, and so if there is something Drill can do to execute a query, it should do that. So I am leaning toward option 3. An untyped-null type is supported as part of a Union vector. This null value is encoded in the 'type' vector. This patch does not introduce a standalone Untyped Null Vector. That will be a separate patch. I will update the design document with what I have said here. > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Hanifi Gunes > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970439#comment-14970439 ] Parth Chandra commented on DRILL-3229: -- Nice doc. I have a couple of quick questions - In the list writer, when the map() method is called, I didn't quite follow the reason for tracking the current field name. What is it needed for? The Type promotion proposal is excellent. But with type promotion we will update the underlying writer to a UnionWriter the moment a type change occurs. Is it possible for us to to have a hierarchy of promotable types and we promote to a higher Scalar type (e.g. Int gets promoted to a Varchar) as a first step and Union if we encounter more than one type change or a change to a complex type. I'm OK if we think this is too complex to implement. How will Screen handle a Union type? In general, a user level tool (sqlline included) will not know how to handle this. Can we have screen return a varchar representation of the Union type? During data exploration the user will then see there are type changes and can then use the type introspection and cast methods appropriately. What about metadata only queries ( i.e select * ... limit 0)? What type would the user application get? For Function Evaluation my preference is to have code generation rather than have UDFs that take a union parameter. For case statements - If a case statment can output a Union type, the end user will presumably have to resolve the different types using type introspection and an outer case statement. Actually I don't have enough idea about end user use cases to choose which is more desirable. Should we leave it as choice #2 and see what users ask for? Jacques had mentioned that you have an idea for introducing a Untyped null type. How would that fit in with this design? > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Steven Phillips > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969613#comment-14969613 ] Steven Phillips commented on DRILL-3229: Design document: https://gist.github.com/StevenMPhillips/41b4a1bd745943d508d2 > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Steven Phillips > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939656#comment-14939656 ] ASF GitHub Bot commented on DRILL-3229: --- GitHub user StevenMPhillips opened a pull request: https://github.com/apache/drill/pull/180 DRILL-3229: Implement Union type vector You can merge this pull request into a Git repository by running: $ git pull https://github.com/StevenMPhillips/drill drill-3229 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/180.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #180 commit e5a529d7a276597f2b62cdcb9a1cab2fae8bc52f Author: Steven Phillips Date: 2015-10-01T10:26:34Z DRILL-3229: Implement Union type vector > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Steven Phillips > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939640#comment-14939640 ] Steven Phillips commented on DRILL-3229: i) In this first iteration, Union types will be enabled with an option, and they will be created in Json Reader and Mongo reader automatically if the option is enabled. Everything will be a Union type in this case. A future patch will work on promoting from non-union once it is necessary to promote. ii) Your understanding is correct. One change from the earlier comment, there is no "bits" vector. The underlying primitive type vectors will have their own "bits" for tracking nulls. The type vector with a value of zero will also indicate null. Without going into much detail at this point, I can answer the next paragraph of question by saying that this patch will allow reading of any valid json. It also has a more literal representation of the json, e.g. null values will be treated as null, instead of empty maps/lists. The patch also includes functions for inspecting the type of a field, which can be used with case statements to handle the data based on which type it is. Though it may be somewhat cumbersome, with these tools you should be able to run almost any query against dynamic json data. This will generally involve using introspection and case statements to remove the Union types early in the query. Future work will eliminate the need for this in many cases. One notable exception is that flatten is not supported in this initial patch. > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Steven Phillips > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907299#comment-14907299 ] Parth Chandra commented on DRILL-3229: -- The union type looks good (haven't delved into the UnionListVector, though it doesn't look too far removed from the UnionVector). I'm missing some details - i) When do we create a Union type? ii) The Union Vector will have a map vector which will have a fields for each minor type. The fields will be nullable vectors of the corresponding minor type. For a given value, only one of the value vectors will have the bits field set. Is my understanding correct? A picture would be a big help. More importantly, can we write up a couple of notes on the big picture so I can see where this fits in? For instance, it is not clear in what cases we plan to use this. There are different use cases where changing schema is encountered. For instance, a large number of nulls followed by a schema that materializes is one frequently encountered case. The other common case is that of a primitive type that appears within quotes in a particular record and gets interpreted as a varchar. More complex cases can occur that have the same information represented differently eg a timestamp that is written either as as string or as a long. (I'm not yet considering the rather extreme example in the yelp data set where a null field shows up as an empty map). Which of these types of cases are we addressing with UnionVectors? Also, one question I've never resolved in my own mind is that of FieldMetadata. Does a ValueVector require FieldMetadata to describe it's structure? Or is it the other way around: FieldMetadata can be derived from the ValueVector. Either way, how do we define FieldMetadata for Union types? What is the impact on ODBC/JDBC, if any? Would a shared doc be a better way to discuss this? Then we can consolidate and add the result to https://drill.apache.org/docs/value-vectors/. > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Steven Phillips > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875831#comment-14875831 ] Jacques Nadeau commented on DRILL-3229: --- [~hgunes] and [~parthc], it would be good to get your feedback on this design. I'll post my notes shortly > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >Reporter: Jacques Nadeau >Assignee: Steven Phillips > Fix For: Future > > > Embedded Vector will leverage a binary encoding for holding information about > type for each individual field. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3229) Create a new EmbeddedVector
[ https://issues.apache.org/jira/browse/DRILL-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14804637#comment-14804637 ] Steven Phillips commented on DRILL-3229: Basic design outline: A Union type represents a field where the type can vary between records. The data for a field of type Union will be stored in a UnionVector. h4. UnionVector Internally uses a MapVector to hold the vectors for the various types. The types include all of the MinorTypes, including List and Map. For example, the internal MapVector will have a subfield named "bigInt", which will refer to a NullableBigIntVector. In addition to the vectors corresponding to the minor types, there will be two additional fields, both represented by UInt1Vectors. These are "bits" and "types", which will represent the nullability and types of the underlying data. The "bits" vector will work the same way it works in other nullable vectors. The "types" vector will store the number corresponding to the value of the MinorType as defined in the protobuf definition. There will be mutator methods for setting null and type. h4. UnionWriter The UnionWriter implements and overwrites all of the methods of FieldWriter. It holds field writers corresponding to each of the types included in the underly UnionVector, and delegates the method calls for each type to the corresponding writer. For example, the BigIntWriter interface: {code} public interface BigIntWriter extends BaseWriter { public void write(BigIntHolder h); public void writeBigInt(long value); } {code} UnionWriter overwrites these methods: {code} @Override public void writeBigInt(long value) { data.getMutator().setType(idx(), MinorType.BIGINT); data.getMutator().setNotNull(idx()); getBigIntWriter().setPosition(idx()); getBigIntWriter().writeBigInt(value); } @Override public void writeBigInt(BigIntHolder h) { data.getMutator().setType(idx(), MinorType.BIGINT); data.getMutator().setNotNull(idx()); getBigIntWriter().setPosition(idx()); getBigIntWriter().writeBigInt(holder.value); } {code} This requires users of the interface to go through the UnionWriter, rather than using the underlying BigIntWriter directly. Otherwise, the "type" and "bits" vector would not get set correctly. h4. UnionReader Much the same as the UnionWriter, the UnionReader overwrites the methods of FieldReader, and delegates to a corresponding specific FieldReader implementation depending on which type the current value is. h4. UnionListVector UnionListVector extends BaseRepeatedVector. It works much the same as other Repeated vectors; there is a data vector and an offset vector. The data vector in this case is a UnionVector. h4. UnionListWriter The UnionListWriter overrides all FieldWriter methods. When starting a new list, the startList() method is called. This calls the startNewValue(int index) method of the underlying UnionListVector.Mutator. Subsequent calls to the ListWriter methods (such as bigint()), return the UnionListWriter itself, and calls to write are handled by calling the appropriate method on the underlying UnionListVector.Mutator, which handles updating the offset vector. In the case that the map() method is called (i.e. repeated map), the UnionListWriter is itself returned, but a state variable is updated to indicate that it should oeprate as a MapWriter. While in MapWriter mode, calls to the MapWriter methods will also return the UnionListWriter itself, but will also update the field indicating what the name of the current field is. Subsequent writes to the ScalarWriter methods will write to the underlying UnionVector using the UnionWriter interface. For example, {code} UnionListWriter list; ... list.startList(); list.map().bigInt("a").writeBigInt(1); {code} This code first indicates that a new list is starting. By doing this, the offset vector is correctly set. Calling map() sets the internal state of the writer to "MAP". bigInt("a") sets the current field of the writer to "a", and writeBigInt(1) writes the value 1 to the underlying UnionVector. Another example: {code} MapWriter mapWriter = list.map().map("a") {code} In this case, the final call to map("a") delegates to the underlying UnionWriter, and returns a new MapWriter, with the position set according to the current offset. > Create a new EmbeddedVector > --- > > Key: DRILL-3229 > URL: https://issues.apache.org/jira/browse/DRILL-3229 > Project: Apache Drill > Issue Type: Sub-task > Components: Execution - Codegen, Execution - Data Types, Execution - > Relational Operators, Functions - Drill >