An interesting set of perspectives. The market has many systems similar to Drill dealing with relational data model. However, there are a large set of non-relational data from various APIs. An efficient and extensible query engine for this type of non-relational schema-on-demand data is what we are looking for.
Here are our perspectives on developing and using Drill: 1) Schema on-demand and non-relational model: this is the primary reason. We use Drill to interface with a schema-less columnar object store, where objects in a collection don't need to have uniform schema. 2) Small foot-print: we use both embedded and clustered mode What we find lacking in Drill: 1) Support for non-relational data model is still very limited. e.g. lacking functions that work directly on non-relational values 2) Documentation. Requires a lot of expertise and experiences to figure out how things work. 3) Not widely adopted causing issues with finding experts to continue our work. -- Jiang On Fri, Jan 10, 2020 at 3:48 AM Igor Guzenko <[email protected]> wrote: > ---------- Forwarded message --------- > From: Igor Guzenko <[email protected]> > Date: Fri, Jan 10, 2020 at 1:46 PM > Subject: Re: About integration of drill and arrow > To: dev <[email protected]> > > > Hello Drill Developers and Drill Users, > > This discussion started as migration to Arrow but uncovered questions of > strategical plans for moving towards Apache Drill 2.0. > Below are my personal thoughts of what we, as developers, should do to > offer Drill users better experience: > > 1. High performant bulk insertions into as many data sources as possible. > There is a whole bunch of different tools for data pipelining to use... > But why people who know SQL should spend time learning something new for > simply moving data between tools? > > 2. Improve the efficiency of memory management (EVF, resource management, > improved costs planning using meta store, etc.). Since we're dealing with > big data alongside other tools installed on data nodes we should utilize > memory very economically and effectively. > > 3. Make integration with all other tools and formats as stable as possible. > The high amount of bugs in the area tells that we have lots to improve. > Every user is happy when he gets a tool and it simply works as expected. > Also, analyze user requirements and provide integration with new most > popular tools. Querying high variety of > data sources were and still one of the biggest selling points. > > 4. Make code highly extensible and extremely friendly for contributions. No > one would want to spend years of learning to make a contribution. This is > why I want to see a lot of modules that are highly cohesive and define > clear APIs for interaction with each other. This is also about paying old > technical debts related to fat JDBC client, copy of web server in Drill on > YARN, mixing everything in exec module, etc. > > 5. Focus on performance improvements of every component, from query > planning to execution. > > These are my thoughts from developer's perspective. Since I'm just > developer from Ukraine and far far away from Drill users, I believe that > Charles Givre is the one who can build a strong Drill user community and > collect their requirements for us. > > > What relates to Volodymyr's suggestion about adapting Arrow and Drill > vectors to work together (the same step is required to implement an Arrow > client, suggested by Paul). > I'm totally against the idea because it brings a huge amount of unnecessary > complexity just to uncover small insides into the integration. First is > that this is against the whole idea of Arrow since the main idea of Arrow > is to provide unified columnar memory layout between different tools > without any data conversions. But the step exactly requires data > conversions, at least our nullability vector and their validity bitmaps are > not the same, also Dict vector and their meaning of Dict may also cause > data conversion. > Another waste is the difference in metadata contracts, who knows whether > it's even possible to combine them. Another problem, like I already > mentioned is the huge complexity of the work, > To do the work I should overcome all underlying pitfalls of both projects, > in addition, I should cover all the untestable code with a comprehensive > amount of tests to show that back and forth conversion is done correctly > for every single unit of data in both vectors. The idea of adapters and > clients is about 4 years old or more and no one did practical work to > implement it. I think I explained why. > > What I really like in Volodymyr's and Paul's suggestions is that we can > extract clear API from existing EVF implementation and in practice provide > Arrow or any other implementation for it. Who knows, maybe with new > improved garbage collectors using direct memory is not necessary at all? It > is quite clear what we need the middle layer between operators and memory, > we need extensive benchmarks over the layer and experiments to show what is > the best underlying memory for Drill. > > What about client tools compatibility there is only one solution I can see > is to provide new clients for Drill 2.0, although I agree that this is a > tremendous amount of work there is no other way for making major steps into > the future. Without it, we should lay back and watch while Drill is slowly > dying and giving up to its competitors. > > NOTE: I want to encourage everyone to join the discussion and share vision > of what should be included in Drill 2.0 and what are strategic points we > want to achieve in the future. > > Kind regards, > Igor > > > On Thu, Jan 9, 2020 at 10:12 PM Paul Rogers <[email protected]> > wrote: > > > Hi Volodymyr, > > > > All good points. The Arrow/Drill conversion is a good option, especially > > for readers and clients. Between operators, such conversion is likely to > > introduce performance hits. As you know, the main feature that > > differentiates one query engine from another is performance, so adding > > conversions is unlikely to help Drill in the performance battle. > > > > Flatten should actually be pretty simple with EVF. Creating repeated > > values is much like filling in implicit columns: set a value, then "copy > > down" n times. > > > > Still, you raise good issues. Operators that fit your description are > > things like exchanges: these operators want to work at the low level of > > buffers. Not much the column readers/writers can do to help. And, as you > > point out, commercial components will be a challenge as Apache Drill does > > not maintain that code. > > > > Your larger point is valid: no matter how we approach it, moving to Arrow > > is a large project that will break compatibility. > > > > We've discussed a simple first step: Support an Arrow client to see if > > there is any interest. Support Arrow readers to see if that gives us any > > benefit. These are the visible tip of the iceberg. If we see advantages, > we > > can then think about changing the internals; the vast bulk of the iceberg > > which is below water and unseen. > > > > I think I disagree that we'd want to swap code that works directly with > > ValueVectors to code that works directly with ArrowVectors. Doing so > locks > > us into someone else's memory format and forces Drill to change every > time > > Arrow changes. Give the small size of the Drill team, and the frantic > pace > > of Arrow change, This was the team's concern early on and I'm still not > > convinced this is a good strategy. > > > > So, my original point remains: what is the benefit of all this cost? Is > > there another path that gives us greater benefit for the same or lesser > > cost? In short, what is our goal? Where do we want Drill to go? > > > > In fact, another radical suggestion is to embrace the wonderful work done > > on Presto. Maybe Drill 2.0 is simply Presto. We focus on adding support > for > > local files (Drill's unique strength), and embrace Presto's great support > > for data types, connectors, UDFs, clients and so on. > > > > As a team, we should ask the fundamental question: What benefits can > Drill > > offer that are not already offered by, say Presto or commercial Drill > > derivatives? If new can answer that question, we'll have a better idea > > about whether investment in Arrow will get us there. > > > > Or, are we better off to just leave well enough alone as we have done for > > several years? > > > > Thanks, > > - Paul > > > > > > > > On Thursday, January 9, 2020, 05:57:52 AM PST, Volodymyr Vysotskyi < > > [email protected]> wrote: > > > > Hi all, > > > > Glad to see that this discussion became active again! > > > > I have some comments regarding the steps for moving from Drill Vectors to > > Arrow Vectors. > > > > No doubt that using EVF for all operators and readers instead of value > > vectors will simplify things a lot. > > But considering the target goal - integration with Arrow, it may be the > > main show-stopper for it. > > There may be some operators which would be hard to adapt to use EVF, for > > example, I think Flatten operator will be among them since its > > implementation deeply connected with value vectors. > > Also, it requires moving all storage and format plugins to EVF, which > also > > may be problematic, for example, some plugins like MaprDB have specific > > features, and it should be considered when moving to EVF. > > Some other plugins are so obsolete, that I'm not sure that they still > work > > and that someone still uses it, so except moving to EVF, they should be > > resurrected to verify that they weren't broken more than before. > > > > This is a huge piece of work, and only after that, we will proceed with > the > > next step - integrating Arrow to EVF and then handling new Arrow-related > > issues for all the operators and readers at the same time. > > > > I propose to update these steps a little bit. > > 1. I agree that at first, we should extract EVF-related classes into a > > separate module. > > 2. But as the next step, I propose to extract EVF API which doesn't > depend > > on the vector implementation (Drill vectors, or Arrow ones). > > 3. After that, introduce module with Arrow which also implements this EVF > > API. > > 4. Introduce transformers that will be able to convert from Drill vectors > > into Arrow vectors and vice versa. These transformers may be implemented > to > > work using EVF abstractions instead of operating with specific vector > > implementations. > > > > 5.1. At this point, we can introduce Arrow connectors to fetch the data > in > > Arrow format or return it in such a format using transformers from step > 4. > > > > 5.2. Also, at this point, we may start rewriting operators to EVF and > > switching EVF implementation from the EVF based on Drill Vectors to the > > implementation which uses Arrow Vectors. Or switching implementations for > > existing EVF-based format plugins and fix newly discovered issues in > Arrow. > > Since at this point we will have operators which use Arrow format and > > operators which use Drill Vectors format, we should insert operators that > > transform one vector format to another introduced in step 4 between every > > pair of operators which returns batches in a different format. > > > > I know, that such an approach requires some additional work, like > > introducing transformers from step 4 and may cause some performance > > degradations for the case when format transformation is complex for some > > types and when we still have sequences of operators with different > formats. > > > > But with this approach, transitioning to Arrow wouldn't be blocked until > > everything is moved to EVF and it would be possible to transmit > > step-by-step, and Drill still will be able to switch between formats if > it > > would be required. > > > > Kind regards, > > Volodymyr Vysotskyi > > > > > > On Thu, Jan 9, 2020 at 2:45 PM Igor Guzenko <[email protected]> > > wrote: > > > > > Hi Paul, > > > > > > Though I have very limited knowledge about Arrow at the moment, I can > > > highlight a few advantages of trying it: > > > 1. Allows fixing all the long-standing nullability issues and provide > > > better integration for storage plugins like Hive. > > > https://jira.apache.org/jira/browse/DRILL-1344 > > > https://jira.apache.org/jira/browse/DRILL-3831 > > > https://jira.apache.org/jira/browse/DRILL-4824 > > > https://jira.apache.org/jira/browse/DRILL-7255 > > > https://jira.apache.org/jira/browse/DRILL-7366 > > > 2. Some work was done by community to implement optimized Arrow readers > > for > > > Parquet and other formats&tools. We could try to adopt and check > whether > > we > > > can benefit from them. > > > 3. Since Arrow is under active development we could try their newest > > > features, like Flight which promises improved data transfers over the > > > network. > > > > > > Thanks, > > > Igor > > > On Wed, Jan 8, 2020 at 11:55 PM Paul Rogers <[email protected] > > > > > wrote: > > > > > > > Hi Igor, > > > > > > > > Before diving into design issues, it may be worthwhile to think about > > the > > > > premise: should Drill adopt Arrow as its internal memory layout? This > > is > > > > the question that the team has wrestled with since Arrow was > launched. > > > > Arrow has three parts. Let's think about each. > > > > > > > > First is a direct memory layout. The approach you suggest will let us > > > work > > > > with the Arrow memory format. Use EVF to access vectors, then the > > > > underlying vectors can be swapped from Drill to Arrow. But, what is > the > > > > advantage of using Arrow? The arrow layout isn't better than Drill's; > > it > > > is > > > > just different. Adopting the Arrow memory layout by itself provides > > > little > > > > benefit, but bit cost. This is one reason the team has been so > > reluctant > > > to > > > > atop Arrow. > > > > > > > > The only advantage of using the Arrow memory layout is if Drill could > > > > benefit from code written for Arrow. The second part of Arrow is a > set > > of > > > > modules to manipulate vectors. Gandiva is the most prominent example. > > > > However, there are major challenges. Most SQL operations are defined > to > > > > work on rows; some clever thinking will be needed to convert those > > > > operations into a series of column operations. (Drill's codegen is > NOT > > > > columnar: it works row-by-row.) So, if we want to benefit from > Gandiva, > > > we > > > > must completely rethink how we process batches. > > > > > > > > Is it worth doing all that work? The primary benefit would be > > > performance. > > > > But, it is not clear that our current implementation is the > bottleneck. > > > The > > > > current implementation is row-based, code generated in Java. Would be > > > great > > > > for someone to do some benchmarks to show the benefit from adopting > > > Gandiva > > > > to see if the potential gain justifies the likely large development > > cost. > > > > > > > > The third advantage of using Arrow is to allow exchange of vectors > > > between > > > > Drill and Arrow-based clients or readers. As it turns out, this is > not > > > the > > > > big win it seems. As we've discussed, we could easily create an > > > Arrow-based > > > > client for Drill -- there will be an RPC between the client and Drill > > and > > > > we can use that to do format conversion. > > > > > > > > For readers, Drill will want control over batch sizes; Drill cannot > > > > blindly accept whatever size vectors a reader chooses to produce. > (More > > > on > > > > that later.) Incoming data will be subject to projection and > selection, > > > so > > > > it will quickly move out of the incoming Arrow vectors into vector > > which > > > > Drill creates. > > > > > > > > Arrow gets (or got) a lot of press. However, our job is to focus on > > > what's > > > > best for Drill. There actually might be a memory layout for Drill > that > > is > > > > better than Arrow (and better than our current vectors.) A couple of > us > > > did > > > > a prototype some time ago that seemed to show promise. So, it is not > > > clear > > > > that adopting Arrow is necessarily a huge win: maybe it is, maybe > not. > > We > > > > need to figure it out. > > > > > > > > What IS clearly a huge win is the idea you outlined: creating a layer > > > > between memory layout and the rest of Drill so that we can try out > > > > different memory layouts to see what works best. > > > > > > > > Thanks, > > > > - Paul > > > > > > > > > > > > > > > > On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko < > > > > [email protected]> wrote: > > > > > > > > Hello Paul, > > > > > > > > I totally agree that integrating Arrow by simply replacing Vectors > > usage > > > > everywhere will cause a disaster. > > > > After the first look at the new *E*nhanced*V*ector*F*ramework and > based > > > on > > > > your suggestions I think I have an idea to share. > > > > In my opinion, the integration can be done in the two major stages: > > > > > > > > *1. Preparation Stage* > > > > 1.1 Extract all EVF and related components to a separate module. > > So > > > > the new separate module will depend only upon Vectors module. > > > > 1.2 Step-by-step rewriting of all operators to use a > higher-level > > > > EVF module and remove Vectors module from exec and modules > > dependencies. > > > > 1.3 Ensure that only module which depends on Vectors is the new > > EVF > > > > one. > > > > *2. Integration Stage* > > > > 2.1 Add dependency on Arrow Vectors module into EVF module. > > > > 2.2 Replace all usages of Drill Vectors & Protobuf Meta with > > > Arrow > > > > Vectors & Flatbuffers Meta in EVF module. > > > > 2.3 Finalize integration by removing Drill Vectors module > > > > completely. > > > > > > > > > > > > *NOTE:* I think that any way we won't preserve any backward > > compatibility > > > > for drivers and custom UDFs. > > > > And proposed changes are a major step forward to be included in Drill > > 2.0 > > > > version. > > > > > > > > > > > > Below is the very first list of packages that in future may be > > > transformed > > > > into EVF module: > > > > *Module:* exec/Vectors > > > > *Packages:* > > > > org.apache.drill.exec.record.metadata - (An enhanced set of classes > to > > > > describe a Drill schema.) > > > > org.apache.drill.exec.record.metadata.schema.parser > > > > > > > > org.apache.drill.exec.vector.accessor - (JSON-like readers and > writers > > > for > > > > each kind of Drill vector.) > > > > org.apache.drill.exec.vector.accessor.convert > > > > org.apache.drill.exec.vector.accessor.impl > > > > org.apache.drill.exec.vector.accessor.reader > > > > org.apache.drill.exec.vector.accessor.writer > > > > org.apache.drill.exec.vector.accessor.writer.dummy > > > > > > > > *Module:* exec/Java Execution Engine > > > > *Packages:* > > > > org.apache.drill.exec.physical.rowSet - (Record batches management) > > > > org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with > memory > > > > mgmt) > > > > org.apache.drill.exec.physical.impl.scan - (Row set based scan) > > > > > > > > Thanks, > > > > Igor Guzenko > > > > > > > > > > > > > >
