[GitHub] drill issue #1016: DRILL-5913:solve the mixed processing of same functions w...
Github user weijietong commented on the issue: https://github.com/apache/drill/pull/1016 @amansinha100 maybe you are familiar with this part of codes . Could you give a review ? anyone else will also be welcome. ---
[GitHub] drill issue #1027: DRILL-4779 : Kafka storage plugin
Github user kameshb commented on the issue: https://github.com/apache/drill/pull/1027 @paul-rogers @arina-ielchiieva @vrozov Thanks for reviewing. Anil & I have addressed review comments. Could you please go through the changes and also rest of the Kafka storage codebase. ---
[GitHub] drill pull request #1027: DRILL-4779 : Kafka storage plugin
Github user kameshb commented on a diff in the pull request: https://github.com/apache/drill/pull/1027#discussion_r150435308 --- Diff: contrib/storage-kafka/src/main/java/org/apache/drill/exec/store/kafka/KafkaRecordReader.java --- @@ -0,0 +1,178 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.kafka; + +import static org.apache.drill.exec.store.kafka.DrillKafkaConfig.DRILL_KAFKA_POLL_TIMEOUT; + +import java.util.Collection; +import java.util.Iterator; +import java.util.List; +import java.util.Set; +import java.util.concurrent.TimeUnit; + +import org.apache.drill.common.exceptions.ExecutionSetupException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.exec.ExecConstants; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.impl.OutputMutator; +import org.apache.drill.exec.store.AbstractRecordReader; +import org.apache.drill.exec.store.kafka.KafkaSubScan.KafkaSubScanSpec; +import org.apache.drill.exec.store.kafka.decoders.MessageReader; +import org.apache.drill.exec.store.kafka.decoders.MessageReaderFactory; +import org.apache.drill.exec.util.Utilities; +import org.apache.drill.exec.vector.complex.impl.VectorContainerWriter; +import org.apache.kafka.clients.consumer.ConsumerRecord; +import org.apache.kafka.clients.consumer.ConsumerRecords; +import org.apache.kafka.clients.consumer.KafkaConsumer; +import org.apache.kafka.common.TopicPartition; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.google.common.base.Stopwatch; +import com.google.common.collect.Lists; +import com.google.common.collect.Sets; +public class KafkaRecordReader extends AbstractRecordReader { + private static final Logger logger = LoggerFactory.getLogger(KafkaRecordReader.class); + public static final long DEFAULT_MESSAGES_PER_BATCH = 4000; + + private VectorContainerWriter writer; + private MessageReader messageReader; + + private boolean unionEnabled; + private KafkaConsumerkafkaConsumer; + private KafkaStoragePlugin plugin; + private KafkaSubScanSpec subScanSpec; + private long kafkaPollTimeOut; + private long endOffset; + + private long currentOffset; + private long totalFetchTime = 0; + + private List partitions; + private final boolean enableAllTextMode; + private final boolean readNumbersAsDouble; + + private Iterator > messageIter; + + public KafkaRecordReader(KafkaSubScan.KafkaSubScanSpec subScanSpec, List projectedColumns, + FragmentContext context, KafkaStoragePlugin plugin) { +setColumns(projectedColumns); +this.enableAllTextMode = context.getOptions().getOption(ExecConstants.KAFKA_ALL_TEXT_MODE).bool_val; +this.readNumbersAsDouble = context.getOptions() + .getOption(ExecConstants.KAFKA_READER_READ_NUMBERS_AS_DOUBLE).bool_val; +this.unionEnabled = context.getOptions().getOption(ExecConstants.ENABLE_UNION_TYPE); +this.plugin = plugin; +this.subScanSpec = subScanSpec; +this.endOffset = subScanSpec.getEndOffset(); +this.kafkaPollTimeOut = Long.valueOf(plugin.getConfig().getDrillKafkaProps().getProperty(DRILL_KAFKA_POLL_TIMEOUT)); + } + + @Override + protected Collection transformColumns(Collection projectedColumns) { +Set transformed = Sets.newLinkedHashSet(); +if (!isStarQuery()) { + for (SchemaPath column : projectedColumns) { +transformed.add(column); + } +} else { + transformed.add(Utilities.STAR_COLUMN); +} +return transformed; + } + + @Override + public void setup(OperatorContext context, OutputMutator output) throws ExecutionSetupException { +this.writer = new VectorContainerWriter(output, unionEnabled);
[jira] [Created] (DRILL-5958) Revisit the List and RepeatedList vectors
Paul Rogers created DRILL-5958: -- Summary: Revisit the List and RepeatedList vectors Key: DRILL-5958 URL: https://issues.apache.org/jira/browse/DRILL-5958 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.11.0 Reporter: Paul Rogers Drill provides a List vector used when reading JSON data. The semantics of this vector are somewhat obscure and overly complex. This ticket asks to clean up the design and implementation of this vector. h4. Current Behavior Drill contains two kinds of repeated types: * Repeated vectors, which exist for all Drill types. * List vectors, which exist outside the repeated vector system. Lists are rather hard to explain. Drill has 38 types. Each type comes in three cardinalities: Required (0), Optional (0, 1) or Repeated (0..n). Thus, there is an {{IntVector}}, a {{NullableIntVector}} and a {{RepeatedIntVector}}. Lists are an an odd duck and exist outside of this system. A list is not simply another level of repetition (a {{RepeatedRepeatedIntVector}}. Rather, a list is heterogeneous: it is just a list of something. For this reason, the List type is closely associated with the Union type: a list is, essentially, a "repeated Union", though it is not implemented that way. Strangely, Drill does have a {{RepeatedListVector}}, which introduces all manner of ambiguity. Combining these, the cardinality hierarchy for unions is: * {{UnionVector}} (like an optional union type) * {{ListVector}} (repeated union) * {{RepeatedListVector}} (a 2D union array) * {{RepeatedListVector}} which contains a {{ListVector}} (a 3D union grid. Note that this could also be implemented as a {{ListVector}} that contains a {{RepeatedListVector}}.) * {{RepeatedListVector}} which contains a {{RepeatedListVector}} (a 4D hyper grid.) * And so on. For a primitive type, such as Int, we have: * {{IntVector}} or {{NullableIntVector}} (cardinality of 1 or (0,1)) * {{RepeatedIntVector}} (a 1D list of Int) * {{ListVector}} which contains a {{RepeatedIntVector}} (a 2D array of ints. Not that this could have been a {{RepeatedListVector}} that stores only ints.) * {{RepeatedListVector}} which contains a {{RepeatedIntVector}} (a 3D cube of ints. This could also be formed by a {{ListVector}} that contains a {{ListVector}} that contains a {{RepeatedIntVector}} along with several other combinations.) h4. Examples of Current Behavior Lists and repeated types appeared to evolve to support JSON-like structures. For example: {code} {a: 10} {a: null} {code} Here, `a` is a nullable scalar and is represented as a {{NullableIntVector}}. {code} {a: [10, 20]} {code} Here, `a` is a list of Int and is represented as a {{RepeatedIntVector}}. Drill does not allow nulls in such vectors, so we cannot represent: {code} {a: [10, null, 20]} {code} Once we go beyond 1D, we need lists: {code} {a: [[10, 20], [30, 40]]} {code} The above requires a {{ListVector}} that contains a {{RepeatedIntVector}}. {code} {a: [[[110, 120], [130, 140]], [210, 220], [230, 240]]} {code} The above requires a {{RepeatedListVector}} that contains a {{RepeatedIntVector}}. Similarly, since lists can hold any type (just like a union), we can have repeated objects: {code} {a: [[{x: 0, y: 0}, {x: 1, y: 0}], [{x: 4, y: 0}, {x: 4, y: 1}]]} {code} The above would be represented as a {{ListVector}} that contains a {{RepeatedMapVector}}. (Or, equivalently, a {{RepeatedListVector}} that contains a {{MapVector}}.) Because the List vector is a union type, it can (presumably) also handle heterogeneous lists (though this needs to be checked to see if the code actually supports this case): {code} {a: [10, "fred", 123.45, null]} {code} Since unions support combinations of not just scalars, but also scalars and complex types, Drill can also support: {code} {a: [10, {b: "foo"}, null, [10, "bob"]]} {code} h4. Muddy Semantics The above show a number of problems that make lists (and unions) far more complex than necessary: * Ambiguity of when to use a {{ListVector}} of {{FooVector}} vs. a {{RepeatedFooVector}}. * Ambiguity of when to use a {{ListVector}} of {{RepeatedFooVector}} vs. a {{RepeatedListVector}} of {{FooVector}}. The same solution used to handle extra layers of repetition is used to handle variant types (DRILL-5955): * Lists can handle any combination of scalars. * Lists can handle any structure type (map, repeated map, list, repeated list). * Lists are thus not typed. They are not a "List of Int", they are just a List. h4. Mapping to SQL The above muddy semantics give rise to this question. Drill is a SQL engine, how do we map the List semantics to a relational schema? If we don't have a clean answer, then the List type, while clever, does not have a useful purpose and is instead distracting us from the larger question of how we map JSON-like structures to a relational schema.
[jira] [Created] (DRILL-5957) Wire protocol versioning, version negotiation
Paul Rogers created DRILL-5957: -- Summary: Wire protocol versioning, version negotiation Key: DRILL-5957 URL: https://issues.apache.org/jira/browse/DRILL-5957 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.11.0 Reporter: Paul Rogers Drill has very limited support for evolving its wire protocol. As Drill becomes more widely deployed, this limitation will constrain the project's ability to rapidly evolve the wire protocol based on user experience to improve simplicitly, performance or minimize resource use. Proposed is a standard mechanism to version the API and negotiate the API version between client and server at connect time. The focus here is between Drill clients (JDBC, ODBC) and the Drill server. The same mechanism can also be used between servers to support rolling upgrades. This proposal is an outline; it is not a detailed design. The purpose here is to drive understanding of the problem. Once we have that, we can focus on the implementation details. h4. Problem Statement The problem we wish to address here concerns both the _syntax_ and _semantics_ of API messages. Syntax concerns: * The set of messages and their sequence * The format of bytes on the wire * The format of message packets Semantics concerns: * The meaning of each field. * The layout of non-message data (vectors, in Drill.) We wish to introduce a system whereby both syntax and semantics can be evolved in a controlled, known manner such that: * A client of version x can connect to, and interoperate with, a server in a range of versions (x-y, x+z) for some values of y and z. For example, version x of the Drill client is deployed in the field. It must connect to the oldest Drill cluster available to that client. (That is it must connect to servers up to y versions old.) During an upgrade, the server may be upgraded before the client. Thus, the client must also work with servers up to z versions newer than the client. If we wish to tackle rolling upgrades, then y and z can both be 1 for server-to-server APIs. A version x server will talk with (x-1) servers when the cluster upgrades to x, and will talk to (x+1) servers when the cluster is upgraded to version (x+1). h4. Current State Drill currently provides some ad-hoc version compatibility: * Slow change. Drill's APIs have not changed much since Drill 1.0, thereby avoiding the issue. * Protobuf support. Drill uses Protobuf for message bodies, leveraging that format's ability to absorb the additional or deprecation of individual fields. * API version number. The API holds a version number, though the code to use it is rather ad-hoc. The above has allowed clever coding to handle some version changes, but each is a one-off, ad-hoc collision. The recent security work is an example that, with enough effort, ad-hoc solutions can be found. The above cannot handle: * Change in the message order * Change in the "pbody/dbody" structure of each message. * Change in the structure of serialized value vectors. As a result, the current structure prevents any change to Drill's core mechanism, value vectors, as there is no way or clients and servers to negotiate the vector wire format. For example, Drill cannot adopt Arrow because a pre-Arrow client would not understand "dbody" message parts encoded in Arrow format and visa-versa. h4. API Version The core of the proposal is to introduce an API version. This is a simple integer which is incremented each time that a breaking change is made to the API. (If the change can be absorbed by the Protobuf mechanism, then it is not a breaking change.) Note that the API version *is not* the same as the product version. Two different Drill versions may have the same API version if nothing changed in the API. h4. Version Negotiation Given a set of well-defined protocol versions, we can next define the version negotiation protocol between client and server: * The client connects and sends a "hello" message that identifies the range of API versions that it supports, with the newest version being the version of the client itself. * The server receives the message and computes the version of the session as the newest client version the the server supports. * The server returns this version to the client which switches to the selected API version. (The server returns an error, and disconnects, if there is no common version.) * The server and client use only messages valid for the given API version. This may mean converting data from one representation to another. The above is pretty standard. h4. Backward Compatibility Implementation Consider a server that must work with its own version (version c) and, say, two older versions (a and b). In most cases, changes across versions are minor. Perhaps version b introduced a better error reporting format (akin to SQLWARN and
[jira] [Created] (DRILL-5956) Add storage plugin for Druid
Jiaqi Liu created DRILL-5956: Summary: Add storage plugin for Druid Key: DRILL-5956 URL: https://issues.apache.org/jira/browse/DRILL-5956 Project: Apache Drill Issue Type: Wish Reporter: Jiaqi Liu As more and more companies are using Druid for mission-critical industrial products, Drill could gain much more popularity with Druid as one of its supported storage plugin so that uses could easily bind Druid cluster to running Drill instance -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DRILL-5955) Revisit Union Vectors
Paul Rogers created DRILL-5955: -- Summary: Revisit Union Vectors Key: DRILL-5955 URL: https://issues.apache.org/jira/browse/DRILL-5955 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.11.0 Reporter: Paul Rogers Drill supports a “Union Vector” type that allows a single column to hold values of multiple types. Conceptually, each column value is a (type, value) pair. For example, row 0 might be an Int, row 1 a Varchar and row 2 a NULL value. The name refers to a C “union” in which the same bit of memory is used to represent one of a set of defined types. Drill implements the union vector a bit like a map: as a collection of typed vectors. Each value is keyed by type. The result is that a union vector is more like a C “struct” than a C “union”: every vector takes space, but only one of the vectors is used for each row. For the example above, the union vector contains an Int vector, a Varchar vector and a type vector. For each row, either the Int or the Varchar is used. For NULL values, neither vector is used. h4. Memory Footprint Concerns The current representation, despite its name, makes very inefficient use of memory because it requires the sum of the storage for each included type. (That is, if we store 1000 rows, we need 1000 slots for integers, another 1000 for Varchar and yet another 1000 for the type vector.) Drill poorly supports the union type. One operator that does support it is the sort. If the union type is enabled, and the sort sees a schema change, the sort will create a new union vector that combines the two types. The result is a sudden, unplanned increase in memory usage. Since the sort can buffer many hundreds of batches, this unplanned memory increase can cause the sort to run out of memory. h4. Muddy Semantics The union vector is closely tied with the List vector: a list vector is, essentially, an array of unions. (See DRILL-). The list type is used to model JSON in which a list can hold anything: another list, an object or scalars. For this reason, the union vector also can hold any type. And, indeed, it can hold a union of any of these types: a Map and an Int, or a List and a Map. Drill is a relational, SQL-based tool. Work is required to bring non-relational structures into Drill. As discussed below, a union of scalars can be made to work. But, a union of structured types (lists, arrays or Maps) makes no sense. h4. High Complexity The union vector, as implemented is quite complex. It contains member variables for every other vector type (except, strangely, the decimal types.) Access to typed members is by type-specific methods, meaning that the client code must include a separate call for every type, resulting in very complex client code. The complexity allowed the union type to be made to work, but causes this one type to consume a disproportionate amount of the vector and client code. h4. Proposed Revision to Structure: The Variant Type Given the above, we can now present the proposed changes. First let us recognize that a union vector need not hold structured types; there are other solutions as discussed in DRILL-. This leaves the union vector to hold just scalars. h4. Proposed Revision to Storage This in turn lets us adopt the [Variant type|https://en.wikipedia.org/wiki/Variant_type] originally introduced in Visual Basic. Variant “is a tagged union that can be used to represent any other data type”. The Variant type was designed to be compact by building on the idea of a tagged union in C. {code} struct { int tag; // type union { int intValue; long longValue; … } } {code} When implemented as a vector, the format could consume just a single variable-width vector with each entry of the form: {{\[type value]}}. The vector is simply a sequence of these (type, value) pairs. The type is a single-byte that encodes the MinorType that describes the value. That is, the type byte is like the existing type vector, but stored in the same location as the data. The data is simply the serialized format of data. (Four bytes for an Int, 8 bytes for a Float8 and so on.) Variable-width types require an extra field: the type field: {{\[type length value]}}. For example, a Varchar would be encoded as {{\[Varchar 27 byte0-26]}}. A writer uses the type to drive the serialization. A reader uses the type to drive deserialization. Note that the type field must include a special marker for nulls. Today, the union type uses 0 to indicate a null value. (Note that, in a union and variant, a null value is not a null of some type, both the type and value are null.) That form should be used in the variant representation as well. But, note that the 0 value in the MajorType enum is not Null but is instead Late. This is an unpleasant messiness that the union (and variant )encoding must
RE: gitbox?
My bad... I was trying to go deeper into the specifics of GitBox via Google and mostly client related results came up. Thanks! -Original Message- From: Uwe L. Korn [mailto:uw...@xhochy.com] Sent: Sunday, November 12, 2017 3:20 AM To: dev@drill.apache.org Subject: Re: gitbox? Note that this discussion is about the new Apache server-side Git services https://gitbox.apache.org/ and not about any specific client. We are very happy with it in the Arrow and I can recommend switching to any other Apache project as soon as possible. Uwe > Am 12.11.2017 um 09:08 schrieb Kunal Khatua: > > Has anyone tried GitKraken? It's a cross platform client that's proven to be > pretty reliable for me for close to a year. > > My concern is that GitBox is exclusive to running on Mac. > > -Original Message- > From: Parth Chandra [mailto:par...@apache.org] > Sent: Tuesday, October 31, 2017 2:52 PM > To: dev > Subject: Re: gitbox? > > Gitbox allows committers to streamline the review and merge process. It > provides a single button in github to merge pull requests to the Apache Drill > mirror on github. This is then synchronized seamlessly with the Apache master. > > The process would still require a committer to 1) review code, 2) run the > functional tests if doing a batch commit. > > Many other Apache projects have already moved to using gitbox. > > > >> On Tue, Oct 31, 2017 at 11:25 AM, Kunal Khatua wrote: >> >> For those of us that missed the hangout, can we get the minutes of >> the meeting? Would help in deciding on the vote rather than be an absentee. >> >> -Original Message- >> From: Parth Chandra [mailto:par...@apache.org] >> Sent: Tuesday, October 31, 2017 10:54 AM >> To: dev >> Subject: Re: gitbox? >> >> Bumping this thread up. >> >> Vlad brought this up in the hangout today and it sounds like we would >> like to move to Gitbox. Thanks Vlad for the patient explanations! >> >> Committers, let's use this thread to vote on the the suggestion. >> >> I'm +1 on moving to gitbox. >> >> Also, I can work with Vlad and Paul on updating the merge process document. >> >> >> >>> On Wed, Aug 30, 2017 at 1:34 PM, Vlad Rozov wrote: >>> >>> Hi, >>> >>> As I am new to Drill, I don't know if migration from "Git WiP" ( >>> https://git-wip-us.apache.org) to "Github Dual Master" ( >>> https://gitbox.apache.org) was already discussed by the community, >>> but from my Apache Apex experience I would recommend to consider >>> migrating Drill ASF repos to the gitbox. Such move will give >>> committers write access to the Drill repository on Github with all >>> the perks that Github >> provides. >>> >>> Thank you, >>> >>> Vlad >>> >>
Re: gitbox?
Note that this discussion is about the new Apache server-side Git services https://gitbox.apache.org/ and not about any specific client. We are very happy with it in the Arrow and I can recommend switching to any other Apache project as soon as possible. Uwe > Am 12.11.2017 um 09:08 schrieb Kunal Khatua: > > Has anyone tried GitKraken? It's a cross platform client that's proven to be > pretty reliable for me for close to a year. > > My concern is that GitBox is exclusive to running on Mac. > > -Original Message- > From: Parth Chandra [mailto:par...@apache.org] > Sent: Tuesday, October 31, 2017 2:52 PM > To: dev > Subject: Re: gitbox? > > Gitbox allows committers to streamline the review and merge process. It > provides a single button in github to merge pull requests to the Apache Drill > mirror on github. This is then synchronized seamlessly with the Apache master. > > The process would still require a committer to 1) review code, 2) run the > functional tests if doing a batch commit. > > Many other Apache projects have already moved to using gitbox. > > > >> On Tue, Oct 31, 2017 at 11:25 AM, Kunal Khatua wrote: >> >> For those of us that missed the hangout, can we get the minutes of the >> meeting? Would help in deciding on the vote rather than be an absentee. >> >> -Original Message- >> From: Parth Chandra [mailto:par...@apache.org] >> Sent: Tuesday, October 31, 2017 10:54 AM >> To: dev >> Subject: Re: gitbox? >> >> Bumping this thread up. >> >> Vlad brought this up in the hangout today and it sounds like we would >> like to move to Gitbox. Thanks Vlad for the patient explanations! >> >> Committers, let's use this thread to vote on the the suggestion. >> >> I'm +1 on moving to gitbox. >> >> Also, I can work with Vlad and Paul on updating the merge process document. >> >> >> >>> On Wed, Aug 30, 2017 at 1:34 PM, Vlad Rozov wrote: >>> >>> Hi, >>> >>> As I am new to Drill, I don't know if migration from "Git WiP" ( >>> https://git-wip-us.apache.org) to "Github Dual Master" ( >>> https://gitbox.apache.org) was already discussed by the community, >>> but from my Apache Apex experience I would recommend to consider >>> migrating Drill ASF repos to the gitbox. Such move will give >>> committers write access to the Drill repository on Github with all >>> the perks that Github >> provides. >>> >>> Thank you, >>> >>> Vlad >>> >>
RE: gitbox?
Has anyone tried GitKraken? It's a cross platform client that's proven to be pretty reliable for me for close to a year. My concern is that GitBox is exclusive to running on Mac. -Original Message- From: Parth Chandra [mailto:par...@apache.org] Sent: Tuesday, October 31, 2017 2:52 PM To: devSubject: Re: gitbox? Gitbox allows committers to streamline the review and merge process. It provides a single button in github to merge pull requests to the Apache Drill mirror on github. This is then synchronized seamlessly with the Apache master. The process would still require a committer to 1) review code, 2) run the functional tests if doing a batch commit. Many other Apache projects have already moved to using gitbox. On Tue, Oct 31, 2017 at 11:25 AM, Kunal Khatua wrote: > For those of us that missed the hangout, can we get the minutes of the > meeting? Would help in deciding on the vote rather than be an absentee. > > -Original Message- > From: Parth Chandra [mailto:par...@apache.org] > Sent: Tuesday, October 31, 2017 10:54 AM > To: dev > Subject: Re: gitbox? > > Bumping this thread up. > > Vlad brought this up in the hangout today and it sounds like we would > like to move to Gitbox. Thanks Vlad for the patient explanations! > > Committers, let's use this thread to vote on the the suggestion. > > I'm +1 on moving to gitbox. > > Also, I can work with Vlad and Paul on updating the merge process document. > > > > On Wed, Aug 30, 2017 at 1:34 PM, Vlad Rozov wrote: > > > Hi, > > > > As I am new to Drill, I don't know if migration from "Git WiP" ( > > https://git-wip-us.apache.org) to "Github Dual Master" ( > > https://gitbox.apache.org) was already discussed by the community, > > but from my Apache Apex experience I would recommend to consider > > migrating Drill ASF repos to the gitbox. Such move will give > > committers write access to the Drill repository on Github with all > > the perks that Github > provides. > > > > Thank you, > > > > Vlad > > >