Jiang, It is sooo cool to hear from actual users in the real world.
I would confirm that I have had real problems using drill on nested data. My particular problem wasn't lack of functions, however. It had to do with the fact that without nullable members of structures, I couldn't tell when fields were missing. On Wed, Jan 15, 2020 at 2:31 PM Jiang Wu <[email protected]> wrote: > An interesting set of perspectives. The market has many systems similar to > Drill dealing with relational data model. However, there are a large set > of non-relational data from various APIs. An efficient and extensible > query engine for this type of non-relational schema-on-demand data is what > we are looking for. > > Here are our perspectives on developing and using Drill: > > 1) Schema on-demand and non-relational model: this is the primary reason. > We use Drill to interface with a schema-less columnar object store, where > objects in a collection don't need to have uniform schema. > 2) Small foot-print: we use both embedded and clustered mode > > What we find lacking in Drill: > > 1) Support for non-relational data model is still very limited. e.g. > lacking functions that work directly on non-relational values > 2) Documentation. Requires a lot of expertise and experiences to figure > out how things work. > 3) Not widely adopted causing issues with finding experts to continue our > work. > > -- Jiang > > > On Fri, Jan 10, 2020 at 3:48 AM Igor Guzenko <[email protected]> > wrote: > > > ---------- Forwarded message --------- > > From: Igor Guzenko <[email protected]> > > Date: Fri, Jan 10, 2020 at 1:46 PM > > Subject: Re: About integration of drill and arrow > > To: dev <[email protected]> > > > > > > Hello Drill Developers and Drill Users, > > > > This discussion started as migration to Arrow but uncovered questions of > > strategical plans for moving towards Apache Drill 2.0. > > Below are my personal thoughts of what we, as developers, should do to > > offer Drill users better experience: > > > > 1. High performant bulk insertions into as many data sources as possible. > > There is a whole bunch of different tools for data pipelining to use... > > But why people who know SQL should spend time learning something new for > > simply moving data between tools? > > > > 2. Improve the efficiency of memory management (EVF, resource management, > > improved costs planning using meta store, etc.). Since we're dealing with > > big data alongside other tools installed on data nodes we should utilize > > memory very economically and effectively. > > > > 3. Make integration with all other tools and formats as stable as > possible. > > The high amount of bugs in the area tells that we have lots to improve. > > Every user is happy when he gets a tool and it simply works as expected. > > Also, analyze user requirements and provide integration with new most > > popular tools. Querying high variety of > > data sources were and still one of the biggest selling points. > > > > 4. Make code highly extensible and extremely friendly for contributions. > No > > one would want to spend years of learning to make a contribution. This is > > why I want to see a lot of modules that are highly cohesive and define > > clear APIs for interaction with each other. This is also about paying old > > technical debts related to fat JDBC client, copy of web server in Drill > on > > YARN, mixing everything in exec module, etc. > > > > 5. Focus on performance improvements of every component, from query > > planning to execution. > > > > These are my thoughts from developer's perspective. Since I'm just > > developer from Ukraine and far far away from Drill users, I believe that > > Charles Givre is the one who can build a strong Drill user community and > > collect their requirements for us. > > > > > > What relates to Volodymyr's suggestion about adapting Arrow and Drill > > vectors to work together (the same step is required to implement an Arrow > > client, suggested by Paul). > > I'm totally against the idea because it brings a huge amount of > unnecessary > > complexity just to uncover small insides into the integration. First is > > that this is against the whole idea of Arrow since the main idea of Arrow > > is to provide unified columnar memory layout between different tools > > without any data conversions. But the step exactly requires data > > conversions, at least our nullability vector and their validity bitmaps > are > > not the same, also Dict vector and their meaning of Dict may also cause > > data conversion. > > Another waste is the difference in metadata contracts, who knows whether > > it's even possible to combine them. Another problem, like I already > > mentioned is the huge complexity of the work, > > To do the work I should overcome all underlying pitfalls of both > projects, > > in addition, I should cover all the untestable code with a comprehensive > > amount of tests to show that back and forth conversion is done correctly > > for every single unit of data in both vectors. The idea of adapters and > > clients is about 4 years old or more and no one did practical work to > > implement it. I think I explained why. > > > > What I really like in Volodymyr's and Paul's suggestions is that we can > > extract clear API from existing EVF implementation and in practice > provide > > Arrow or any other implementation for it. Who knows, maybe with new > > improved garbage collectors using direct memory is not necessary at all? > It > > is quite clear what we need the middle layer between operators and > memory, > > we need extensive benchmarks over the layer and experiments to show what > is > > the best underlying memory for Drill. > > > > What about client tools compatibility there is only one solution I can > see > > is to provide new clients for Drill 2.0, although I agree that this is a > > tremendous amount of work there is no other way for making major steps > into > > the future. Without it, we should lay back and watch while Drill is > slowly > > dying and giving up to its competitors. > > > > NOTE: I want to encourage everyone to join the discussion and share > vision > > of what should be included in Drill 2.0 and what are strategic points we > > want to achieve in the future. > > > > Kind regards, > > Igor > > > > > > On Thu, Jan 9, 2020 at 10:12 PM Paul Rogers <[email protected]> > > wrote: > > > > > Hi Volodymyr, > > > > > > All good points. The Arrow/Drill conversion is a good option, > especially > > > for readers and clients. Between operators, such conversion is likely > to > > > introduce performance hits. As you know, the main feature that > > > differentiates one query engine from another is performance, so adding > > > conversions is unlikely to help Drill in the performance battle. > > > > > > Flatten should actually be pretty simple with EVF. Creating repeated > > > values is much like filling in implicit columns: set a value, then > "copy > > > down" n times. > > > > > > Still, you raise good issues. Operators that fit your description are > > > things like exchanges: these operators want to work at the low level of > > > buffers. Not much the column readers/writers can do to help. And, as > you > > > point out, commercial components will be a challenge as Apache Drill > does > > > not maintain that code. > > > > > > Your larger point is valid: no matter how we approach it, moving to > Arrow > > > is a large project that will break compatibility. > > > > > > We've discussed a simple first step: Support an Arrow client to see if > > > there is any interest. Support Arrow readers to see if that gives us > any > > > benefit. These are the visible tip of the iceberg. If we see > advantages, > > we > > > can then think about changing the internals; the vast bulk of the > iceberg > > > which is below water and unseen. > > > > > > I think I disagree that we'd want to swap code that works directly with > > > ValueVectors to code that works directly with ArrowVectors. Doing so > > locks > > > us into someone else's memory format and forces Drill to change every > > time > > > Arrow changes. Give the small size of the Drill team, and the frantic > > pace > > > of Arrow change, This was the team's concern early on and I'm still not > > > convinced this is a good strategy. > > > > > > So, my original point remains: what is the benefit of all this cost? Is > > > there another path that gives us greater benefit for the same or lesser > > > cost? In short, what is our goal? Where do we want Drill to go? > > > > > > In fact, another radical suggestion is to embrace the wonderful work > done > > > on Presto. Maybe Drill 2.0 is simply Presto. We focus on adding support > > for > > > local files (Drill's unique strength), and embrace Presto's great > support > > > for data types, connectors, UDFs, clients and so on. > > > > > > As a team, we should ask the fundamental question: What benefits can > > Drill > > > offer that are not already offered by, say Presto or commercial Drill > > > derivatives? If new can answer that question, we'll have a better idea > > > about whether investment in Arrow will get us there. > > > > > > Or, are we better off to just leave well enough alone as we have done > for > > > several years? > > > > > > Thanks, > > > - Paul > > > > > > > > > > > > On Thursday, January 9, 2020, 05:57:52 AM PST, Volodymyr Vysotskyi > < > > > [email protected]> wrote: > > > > > > Hi all, > > > > > > Glad to see that this discussion became active again! > > > > > > I have some comments regarding the steps for moving from Drill Vectors > to > > > Arrow Vectors. > > > > > > No doubt that using EVF for all operators and readers instead of value > > > vectors will simplify things a lot. > > > But considering the target goal - integration with Arrow, it may be the > > > main show-stopper for it. > > > There may be some operators which would be hard to adapt to use EVF, > for > > > example, I think Flatten operator will be among them since its > > > implementation deeply connected with value vectors. > > > Also, it requires moving all storage and format plugins to EVF, which > > also > > > may be problematic, for example, some plugins like MaprDB have specific > > > features, and it should be considered when moving to EVF. > > > Some other plugins are so obsolete, that I'm not sure that they still > > work > > > and that someone still uses it, so except moving to EVF, they should be > > > resurrected to verify that they weren't broken more than before. > > > > > > This is a huge piece of work, and only after that, we will proceed with > > the > > > next step - integrating Arrow to EVF and then handling new > Arrow-related > > > issues for all the operators and readers at the same time. > > > > > > I propose to update these steps a little bit. > > > 1. I agree that at first, we should extract EVF-related classes into a > > > separate module. > > > 2. But as the next step, I propose to extract EVF API which doesn't > > depend > > > on the vector implementation (Drill vectors, or Arrow ones). > > > 3. After that, introduce module with Arrow which also implements this > EVF > > > API. > > > 4. Introduce transformers that will be able to convert from Drill > vectors > > > into Arrow vectors and vice versa. These transformers may be > implemented > > to > > > work using EVF abstractions instead of operating with specific vector > > > implementations. > > > > > > 5.1. At this point, we can introduce Arrow connectors to fetch the data > > in > > > Arrow format or return it in such a format using transformers from step > > 4. > > > > > > 5.2. Also, at this point, we may start rewriting operators to EVF and > > > switching EVF implementation from the EVF based on Drill Vectors to the > > > implementation which uses Arrow Vectors. Or switching implementations > for > > > existing EVF-based format plugins and fix newly discovered issues in > > Arrow. > > > Since at this point we will have operators which use Arrow format and > > > operators which use Drill Vectors format, we should insert operators > that > > > transform one vector format to another introduced in step 4 between > every > > > pair of operators which returns batches in a different format. > > > > > > I know, that such an approach requires some additional work, like > > > introducing transformers from step 4 and may cause some performance > > > degradations for the case when format transformation is complex for > some > > > types and when we still have sequences of operators with different > > formats. > > > > > > But with this approach, transitioning to Arrow wouldn't be blocked > until > > > everything is moved to EVF and it would be possible to transmit > > > step-by-step, and Drill still will be able to switch between formats if > > it > > > would be required. > > > > > > Kind regards, > > > Volodymyr Vysotskyi > > > > > > > > > On Thu, Jan 9, 2020 at 2:45 PM Igor Guzenko < > [email protected]> > > > wrote: > > > > > > > Hi Paul, > > > > > > > > Though I have very limited knowledge about Arrow at the moment, I can > > > > highlight a few advantages of trying it: > > > > 1. Allows fixing all the long-standing nullability issues and > provide > > > > better integration for storage plugins like Hive. > > > > https://jira.apache.org/jira/browse/DRILL-1344 > > > > https://jira.apache.org/jira/browse/DRILL-3831 > > > > https://jira.apache.org/jira/browse/DRILL-4824 > > > > https://jira.apache.org/jira/browse/DRILL-7255 > > > > https://jira.apache.org/jira/browse/DRILL-7366 > > > > 2. Some work was done by community to implement optimized Arrow > readers > > > for > > > > Parquet and other formats&tools. We could try to adopt and check > > whether > > > we > > > > can benefit from them. > > > > 3. Since Arrow is under active development we could try their newest > > > > features, like Flight which promises improved data transfers over the > > > > network. > > > > > > > > Thanks, > > > > Igor > > > > On Wed, Jan 8, 2020 at 11:55 PM Paul Rogers > <[email protected] > > > > > > > wrote: > > > > > > > > > Hi Igor, > > > > > > > > > > Before diving into design issues, it may be worthwhile to think > about > > > the > > > > > premise: should Drill adopt Arrow as its internal memory layout? > This > > > is > > > > > the question that the team has wrestled with since Arrow was > > launched. > > > > > Arrow has three parts. Let's think about each. > > > > > > > > > > First is a direct memory layout. The approach you suggest will let > us > > > > work > > > > > with the Arrow memory format. Use EVF to access vectors, then the > > > > > underlying vectors can be swapped from Drill to Arrow. But, what is > > the > > > > > advantage of using Arrow? The arrow layout isn't better than > Drill's; > > > it > > > > is > > > > > just different. Adopting the Arrow memory layout by itself provides > > > > little > > > > > benefit, but bit cost. This is one reason the team has been so > > > reluctant > > > > to > > > > > atop Arrow. > > > > > > > > > > The only advantage of using the Arrow memory layout is if Drill > could > > > > > benefit from code written for Arrow. The second part of Arrow is a > > set > > > of > > > > > modules to manipulate vectors. Gandiva is the most prominent > example. > > > > > However, there are major challenges. Most SQL operations are > defined > > to > > > > > work on rows; some clever thinking will be needed to convert those > > > > > operations into a series of column operations. (Drill's codegen is > > NOT > > > > > columnar: it works row-by-row.) So, if we want to benefit from > > Gandiva, > > > > we > > > > > must completely rethink how we process batches. > > > > > > > > > > Is it worth doing all that work? The primary benefit would be > > > > performance. > > > > > But, it is not clear that our current implementation is the > > bottleneck. > > > > The > > > > > current implementation is row-based, code generated in Java. Would > be > > > > great > > > > > for someone to do some benchmarks to show the benefit from adopting > > > > Gandiva > > > > > to see if the potential gain justifies the likely large development > > > cost. > > > > > > > > > > The third advantage of using Arrow is to allow exchange of vectors > > > > between > > > > > Drill and Arrow-based clients or readers. As it turns out, this is > > not > > > > the > > > > > big win it seems. As we've discussed, we could easily create an > > > > Arrow-based > > > > > client for Drill -- there will be an RPC between the client and > Drill > > > and > > > > > we can use that to do format conversion. > > > > > > > > > > For readers, Drill will want control over batch sizes; Drill cannot > > > > > blindly accept whatever size vectors a reader chooses to produce. > > (More > > > > on > > > > > that later.) Incoming data will be subject to projection and > > selection, > > > > so > > > > > it will quickly move out of the incoming Arrow vectors into vector > > > which > > > > > Drill creates. > > > > > > > > > > Arrow gets (or got) a lot of press. However, our job is to focus on > > > > what's > > > > > best for Drill. There actually might be a memory layout for Drill > > that > > > is > > > > > better than Arrow (and better than our current vectors.) A couple > of > > us > > > > did > > > > > a prototype some time ago that seemed to show promise. So, it is > not > > > > clear > > > > > that adopting Arrow is necessarily a huge win: maybe it is, maybe > > not. > > > We > > > > > need to figure it out. > > > > > > > > > > What IS clearly a huge win is the idea you outlined: creating a > layer > > > > > between memory layout and the rest of Drill so that we can try out > > > > > different memory layouts to see what works best. > > > > > > > > > > Thanks, > > > > > - Paul > > > > > > > > > > > > > > > > > > > > On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko < > > > > > [email protected]> wrote: > > > > > > > > > > Hello Paul, > > > > > > > > > > I totally agree that integrating Arrow by simply replacing Vectors > > > usage > > > > > everywhere will cause a disaster. > > > > > After the first look at the new *E*nhanced*V*ector*F*ramework and > > based > > > > on > > > > > your suggestions I think I have an idea to share. > > > > > In my opinion, the integration can be done in the two major stages: > > > > > > > > > > *1. Preparation Stage* > > > > > 1.1 Extract all EVF and related components to a separate > module. > > > So > > > > > the new separate module will depend only upon Vectors module. > > > > > 1.2 Step-by-step rewriting of all operators to use a > > higher-level > > > > > EVF module and remove Vectors module from exec and modules > > > dependencies. > > > > > 1.3 Ensure that only module which depends on Vectors is the > new > > > EVF > > > > > one. > > > > > *2. Integration Stage* > > > > > 2.1 Add dependency on Arrow Vectors module into EVF module. > > > > > 2.2 Replace all usages of Drill Vectors & Protobuf Meta with > > > > Arrow > > > > > Vectors & Flatbuffers Meta in EVF module. > > > > > 2.3 Finalize integration by removing Drill Vectors module > > > > > completely. > > > > > > > > > > > > > > > *NOTE:* I think that any way we won't preserve any backward > > > compatibility > > > > > for drivers and custom UDFs. > > > > > And proposed changes are a major step forward to be included in > Drill > > > 2.0 > > > > > version. > > > > > > > > > > > > > > > Below is the very first list of packages that in future may be > > > > transformed > > > > > into EVF module: > > > > > *Module:* exec/Vectors > > > > > *Packages:* > > > > > org.apache.drill.exec.record.metadata - (An enhanced set of classes > > to > > > > > describe a Drill schema.) > > > > > org.apache.drill.exec.record.metadata.schema.parser > > > > > > > > > > org.apache.drill.exec.vector.accessor - (JSON-like readers and > > writers > > > > for > > > > > each kind of Drill vector.) > > > > > org.apache.drill.exec.vector.accessor.convert > > > > > org.apache.drill.exec.vector.accessor.impl > > > > > org.apache.drill.exec.vector.accessor.reader > > > > > org.apache.drill.exec.vector.accessor.writer > > > > > org.apache.drill.exec.vector.accessor.writer.dummy > > > > > > > > > > *Module:* exec/Java Execution Engine > > > > > *Packages:* > > > > > org.apache.drill.exec.physical.rowSet - (Record batches management) > > > > > org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with > > memory > > > > > mgmt) > > > > > org.apache.drill.exec.physical.impl.scan - (Row set based scan) > > > > > > > > > > Thanks, > > > > > Igor Guzenko > > > > > > > > > > > > > > > > > > > >
