Hi Igor, Thanks for your thoughts. I'm a little swamped today, but send a response over the weekend. Perhaps would you and your team in the Ukraine be interested in doing a virtual get together to discuss further? I'm based on eastern time in the US, so it's a little more convenient than California time. Thanks, -- C
> On Jan 10, 2020, at 6:47 AM, Igor Guzenko <ihor.huzenko....@gmail.com> wrote: > > ---------- Forwarded message --------- > From: Igor Guzenko <ihor.huzenko....@gmail.com> > Date: Fri, Jan 10, 2020 at 1:46 PM > Subject: Re: About integration of drill and arrow > To: dev <d...@drill.apache.org> > > > Hello Drill Developers and Drill Users, > > This discussion started as migration to Arrow but uncovered questions of > strategical plans for moving towards Apache Drill 2.0. > Below are my personal thoughts of what we, as developers, should do to > offer Drill users better experience: > > 1. High performant bulk insertions into as many data sources as possible. > There is a whole bunch of different tools for data pipelining to use... > But why people who know SQL should spend time learning something new for > simply moving data between tools? > > 2. Improve the efficiency of memory management (EVF, resource management, > improved costs planning using meta store, etc.). Since we're dealing with > big data alongside other tools installed on data nodes we should utilize > memory very economically and effectively. > > 3. Make integration with all other tools and formats as stable as possible. > The high amount of bugs in the area tells that we have lots to improve. > Every user is happy when he gets a tool and it simply works as expected. > Also, analyze user requirements and provide integration with new most > popular tools. Querying high variety of > data sources were and still one of the biggest selling points. > > 4. Make code highly extensible and extremely friendly for contributions. No > one would want to spend years of learning to make a contribution. This is > why I want to see a lot of modules that are highly cohesive and define > clear APIs for interaction with each other. This is also about paying old > technical debts related to fat JDBC client, copy of web server in Drill on > YARN, mixing everything in exec module, etc. > > 5. Focus on performance improvements of every component, from query > planning to execution. > > These are my thoughts from developer's perspective. Since I'm just > developer from Ukraine and far far away from Drill users, I believe that > Charles Givre is the one who can build a strong Drill user community and > collect their requirements for us. > > > What relates to Volodymyr's suggestion about adapting Arrow and Drill > vectors to work together (the same step is required to implement an Arrow > client, suggested by Paul). > I'm totally against the idea because it brings a huge amount of unnecessary > complexity just to uncover small insides into the integration. First is > that this is against the whole idea of Arrow since the main idea of Arrow > is to provide unified columnar memory layout between different tools > without any data conversions. But the step exactly requires data > conversions, at least our nullability vector and their validity bitmaps are > not the same, also Dict vector and their meaning of Dict may also cause > data conversion. > Another waste is the difference in metadata contracts, who knows whether > it's even possible to combine them. Another problem, like I already > mentioned is the huge complexity of the work, > To do the work I should overcome all underlying pitfalls of both projects, > in addition, I should cover all the untestable code with a comprehensive > amount of tests to show that back and forth conversion is done correctly > for every single unit of data in both vectors. The idea of adapters and > clients is about 4 years old or more and no one did practical work to > implement it. I think I explained why. > > What I really like in Volodymyr's and Paul's suggestions is that we can > extract clear API from existing EVF implementation and in practice provide > Arrow or any other implementation for it. Who knows, maybe with new > improved garbage collectors using direct memory is not necessary at all? It > is quite clear what we need the middle layer between operators and memory, > we need extensive benchmarks over the layer and experiments to show what is > the best underlying memory for Drill. > > What about client tools compatibility there is only one solution I can see > is to provide new clients for Drill 2.0, although I agree that this is a > tremendous amount of work there is no other way for making major steps into > the future. Without it, we should lay back and watch while Drill is slowly > dying and giving up to its competitors. > > NOTE: I want to encourage everyone to join the discussion and share vision > of what should be included in Drill 2.0 and what are strategic points we > want to achieve in the future. > > Kind regards, > Igor > > > On Thu, Jan 9, 2020 at 10:12 PM Paul Rogers <par0...@yahoo.com.invalid> > wrote: > >> Hi Volodymyr, >> >> All good points. The Arrow/Drill conversion is a good option, especially >> for readers and clients. Between operators, such conversion is likely to >> introduce performance hits. As you know, the main feature that >> differentiates one query engine from another is performance, so adding >> conversions is unlikely to help Drill in the performance battle. >> >> Flatten should actually be pretty simple with EVF. Creating repeated >> values is much like filling in implicit columns: set a value, then "copy >> down" n times. >> >> Still, you raise good issues. Operators that fit your description are >> things like exchanges: these operators want to work at the low level of >> buffers. Not much the column readers/writers can do to help. And, as you >> point out, commercial components will be a challenge as Apache Drill does >> not maintain that code. >> >> Your larger point is valid: no matter how we approach it, moving to Arrow >> is a large project that will break compatibility. >> >> We've discussed a simple first step: Support an Arrow client to see if >> there is any interest. Support Arrow readers to see if that gives us any >> benefit. These are the visible tip of the iceberg. If we see advantages, we >> can then think about changing the internals; the vast bulk of the iceberg >> which is below water and unseen. >> >> I think I disagree that we'd want to swap code that works directly with >> ValueVectors to code that works directly with ArrowVectors. Doing so locks >> us into someone else's memory format and forces Drill to change every time >> Arrow changes. Give the small size of the Drill team, and the frantic pace >> of Arrow change, This was the team's concern early on and I'm still not >> convinced this is a good strategy. >> >> So, my original point remains: what is the benefit of all this cost? Is >> there another path that gives us greater benefit for the same or lesser >> cost? In short, what is our goal? Where do we want Drill to go? >> >> In fact, another radical suggestion is to embrace the wonderful work done >> on Presto. Maybe Drill 2.0 is simply Presto. We focus on adding support for >> local files (Drill's unique strength), and embrace Presto's great support >> for data types, connectors, UDFs, clients and so on. >> >> As a team, we should ask the fundamental question: What benefits can Drill >> offer that are not already offered by, say Presto or commercial Drill >> derivatives? If new can answer that question, we'll have a better idea >> about whether investment in Arrow will get us there. >> >> Or, are we better off to just leave well enough alone as we have done for >> several years? >> >> Thanks, >> - Paul >> >> >> >> On Thursday, January 9, 2020, 05:57:52 AM PST, Volodymyr Vysotskyi < >> volody...@apache.org> wrote: >> >> Hi all, >> >> Glad to see that this discussion became active again! >> >> I have some comments regarding the steps for moving from Drill Vectors to >> Arrow Vectors. >> >> No doubt that using EVF for all operators and readers instead of value >> vectors will simplify things a lot. >> But considering the target goal - integration with Arrow, it may be the >> main show-stopper for it. >> There may be some operators which would be hard to adapt to use EVF, for >> example, I think Flatten operator will be among them since its >> implementation deeply connected with value vectors. >> Also, it requires moving all storage and format plugins to EVF, which also >> may be problematic, for example, some plugins like MaprDB have specific >> features, and it should be considered when moving to EVF. >> Some other plugins are so obsolete, that I'm not sure that they still work >> and that someone still uses it, so except moving to EVF, they should be >> resurrected to verify that they weren't broken more than before. >> >> This is a huge piece of work, and only after that, we will proceed with the >> next step - integrating Arrow to EVF and then handling new Arrow-related >> issues for all the operators and readers at the same time. >> >> I propose to update these steps a little bit. >> 1. I agree that at first, we should extract EVF-related classes into a >> separate module. >> 2. But as the next step, I propose to extract EVF API which doesn't depend >> on the vector implementation (Drill vectors, or Arrow ones). >> 3. After that, introduce module with Arrow which also implements this EVF >> API. >> 4. Introduce transformers that will be able to convert from Drill vectors >> into Arrow vectors and vice versa. These transformers may be implemented to >> work using EVF abstractions instead of operating with specific vector >> implementations. >> >> 5.1. At this point, we can introduce Arrow connectors to fetch the data in >> Arrow format or return it in such a format using transformers from step 4. >> >> 5.2. Also, at this point, we may start rewriting operators to EVF and >> switching EVF implementation from the EVF based on Drill Vectors to the >> implementation which uses Arrow Vectors. Or switching implementations for >> existing EVF-based format plugins and fix newly discovered issues in Arrow. >> Since at this point we will have operators which use Arrow format and >> operators which use Drill Vectors format, we should insert operators that >> transform one vector format to another introduced in step 4 between every >> pair of operators which returns batches in a different format. >> >> I know, that such an approach requires some additional work, like >> introducing transformers from step 4 and may cause some performance >> degradations for the case when format transformation is complex for some >> types and when we still have sequences of operators with different formats. >> >> But with this approach, transitioning to Arrow wouldn't be blocked until >> everything is moved to EVF and it would be possible to transmit >> step-by-step, and Drill still will be able to switch between formats if it >> would be required. >> >> Kind regards, >> Volodymyr Vysotskyi >> >> >> On Thu, Jan 9, 2020 at 2:45 PM Igor Guzenko <ihor.huzenko....@gmail.com> >> wrote: >> >>> Hi Paul, >>> >>> Though I have very limited knowledge about Arrow at the moment, I can >>> highlight a few advantages of trying it: >>> 1. Allows fixing all the long-standing nullability issues and provide >>> better integration for storage plugins like Hive. >>> https://jira.apache.org/jira/browse/DRILL-1344 >>> https://jira.apache.org/jira/browse/DRILL-3831 >>> https://jira.apache.org/jira/browse/DRILL-4824 >>> https://jira.apache.org/jira/browse/DRILL-7255 >>> https://jira.apache.org/jira/browse/DRILL-7366 >>> 2. Some work was done by community to implement optimized Arrow readers >> for >>> Parquet and other formats&tools. We could try to adopt and check whether >> we >>> can benefit from them. >>> 3. Since Arrow is under active development we could try their newest >>> features, like Flight which promises improved data transfers over the >>> network. >>> >>> Thanks, >>> Igor >>> On Wed, Jan 8, 2020 at 11:55 PM Paul Rogers <par0...@yahoo.com.invalid> >>> wrote: >>> >>>> Hi Igor, >>>> >>>> Before diving into design issues, it may be worthwhile to think about >> the >>>> premise: should Drill adopt Arrow as its internal memory layout? This >> is >>>> the question that the team has wrestled with since Arrow was launched. >>>> Arrow has three parts. Let's think about each. >>>> >>>> First is a direct memory layout. The approach you suggest will let us >>> work >>>> with the Arrow memory format. Use EVF to access vectors, then the >>>> underlying vectors can be swapped from Drill to Arrow. But, what is the >>>> advantage of using Arrow? The arrow layout isn't better than Drill's; >> it >>> is >>>> just different. Adopting the Arrow memory layout by itself provides >>> little >>>> benefit, but bit cost. This is one reason the team has been so >> reluctant >>> to >>>> atop Arrow. >>>> >>>> The only advantage of using the Arrow memory layout is if Drill could >>>> benefit from code written for Arrow. The second part of Arrow is a set >> of >>>> modules to manipulate vectors. Gandiva is the most prominent example. >>>> However, there are major challenges. Most SQL operations are defined to >>>> work on rows; some clever thinking will be needed to convert those >>>> operations into a series of column operations. (Drill's codegen is NOT >>>> columnar: it works row-by-row.) So, if we want to benefit from Gandiva, >>> we >>>> must completely rethink how we process batches. >>>> >>>> Is it worth doing all that work? The primary benefit would be >>> performance. >>>> But, it is not clear that our current implementation is the bottleneck. >>> The >>>> current implementation is row-based, code generated in Java. Would be >>> great >>>> for someone to do some benchmarks to show the benefit from adopting >>> Gandiva >>>> to see if the potential gain justifies the likely large development >> cost. >>>> >>>> The third advantage of using Arrow is to allow exchange of vectors >>> between >>>> Drill and Arrow-based clients or readers. As it turns out, this is not >>> the >>>> big win it seems. As we've discussed, we could easily create an >>> Arrow-based >>>> client for Drill -- there will be an RPC between the client and Drill >> and >>>> we can use that to do format conversion. >>>> >>>> For readers, Drill will want control over batch sizes; Drill cannot >>>> blindly accept whatever size vectors a reader chooses to produce. (More >>> on >>>> that later.) Incoming data will be subject to projection and selection, >>> so >>>> it will quickly move out of the incoming Arrow vectors into vector >> which >>>> Drill creates. >>>> >>>> Arrow gets (or got) a lot of press. However, our job is to focus on >>> what's >>>> best for Drill. There actually might be a memory layout for Drill that >> is >>>> better than Arrow (and better than our current vectors.) A couple of us >>> did >>>> a prototype some time ago that seemed to show promise. So, it is not >>> clear >>>> that adopting Arrow is necessarily a huge win: maybe it is, maybe not. >> We >>>> need to figure it out. >>>> >>>> What IS clearly a huge win is the idea you outlined: creating a layer >>>> between memory layout and the rest of Drill so that we can try out >>>> different memory layouts to see what works best. >>>> >>>> Thanks, >>>> - Paul >>>> >>>> >>>> >>>> On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko < >>>> ihor.huzenko....@gmail.com> wrote: >>>> >>>> Hello Paul, >>>> >>>> I totally agree that integrating Arrow by simply replacing Vectors >> usage >>>> everywhere will cause a disaster. >>>> After the first look at the new *E*nhanced*V*ector*F*ramework and based >>> on >>>> your suggestions I think I have an idea to share. >>>> In my opinion, the integration can be done in the two major stages: >>>> >>>> *1. Preparation Stage* >>>> 1.1 Extract all EVF and related components to a separate module. >> So >>>> the new separate module will depend only upon Vectors module. >>>> 1.2 Step-by-step rewriting of all operators to use a higher-level >>>> EVF module and remove Vectors module from exec and modules >> dependencies. >>>> 1.3 Ensure that only module which depends on Vectors is the new >> EVF >>>> one. >>>> *2. Integration Stage* >>>> 2.1 Add dependency on Arrow Vectors module into EVF module. >>>> 2.2 Replace all usages of Drill Vectors & Protobuf Meta with >>> Arrow >>>> Vectors & Flatbuffers Meta in EVF module. >>>> 2.3 Finalize integration by removing Drill Vectors module >>>> completely. >>>> >>>> >>>> *NOTE:* I think that any way we won't preserve any backward >> compatibility >>>> for drivers and custom UDFs. >>>> And proposed changes are a major step forward to be included in Drill >> 2.0 >>>> version. >>>> >>>> >>>> Below is the very first list of packages that in future may be >>> transformed >>>> into EVF module: >>>> *Module:* exec/Vectors >>>> *Packages:* >>>> org.apache.drill.exec.record.metadata - (An enhanced set of classes to >>>> describe a Drill schema.) >>>> org.apache.drill.exec.record.metadata.schema.parser >>>> >>>> org.apache.drill.exec.vector.accessor - (JSON-like readers and writers >>> for >>>> each kind of Drill vector.) >>>> org.apache.drill.exec.vector.accessor.convert >>>> org.apache.drill.exec.vector.accessor.impl >>>> org.apache.drill.exec.vector.accessor.reader >>>> org.apache.drill.exec.vector.accessor.writer >>>> org.apache.drill.exec.vector.accessor.writer.dummy >>>> >>>> *Module:* exec/Java Execution Engine >>>> *Packages:* >>>> org.apache.drill.exec.physical.rowSet - (Record batches management) >>>> org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory >>>> mgmt) >>>> org.apache.drill.exec.physical.impl.scan - (Row set based scan) >>>> >>>> Thanks, >>>> Igor Guzenko >>>> >>>> >>> >>