Re: About integration of drill and arrow

Jiang Wu Wed, 15 Jan 2020 14:31:42 -0800

An interesting set of perspectives.  The market has many systems similar to
Drill dealing with relational data model.  However, there are a large set
of non-relational data from various APIs.  An efficient and extensible
query engine for this type of non-relational schema-on-demand data is what
we are looking for.


Here are our perspectives on developing and using Drill:

1) Schema on-demand and non-relational model: this is the primary reason.
We use Drill to interface with a schema-less columnar object store, where
objects in a collection don't need to have uniform schema.
2) Small foot-print: we use both embedded and clustered mode

What we find lacking in Drill:

1) Support for non-relational data model is still very limited.  e.g.
lacking functions that work directly on non-relational values
2) Documentation.  Requires a lot of expertise and experiences to figure
out how things work.
3) Not widely adopted causing issues with finding experts to continue our
work.

-- Jiang


On Fri, Jan 10, 2020 at 3:48 AM Igor Guzenko <[email protected]>
wrote:

> ---------- Forwarded message ---------
> From: Igor Guzenko <[email protected]>
> Date: Fri, Jan 10, 2020 at 1:46 PM
> Subject: Re: About integration of drill and arrow
> To: dev <[email protected]>
>
>
> Hello Drill Developers and Drill Users,
>
> This discussion started as migration to Arrow but uncovered questions of
> strategical plans for moving towards Apache Drill 2.0.
> Below are my personal thoughts of what we, as developers, should do to
> offer Drill users better experience:
>
> 1. High performant bulk insertions into as many data sources as possible.
> There is a whole bunch of different tools for data pipelining to use...
> But why people who know SQL should spend time learning something new for
> simply moving data between tools?
>
> 2. Improve the efficiency of memory management (EVF, resource management,
> improved costs planning using meta store, etc.). Since we're dealing with
> big data alongside other tools installed on data nodes we should utilize
> memory very economically and effectively.
>
> 3. Make integration with all other tools and formats as stable as possible.
> The high amount of bugs in the area tells that we have lots to improve.
> Every user is happy when he gets a tool and it simply works as expected.
> Also, analyze user requirements and provide integration with new most
> popular tools.  Querying high variety of
> data sources were and still one of the biggest selling points.
>
> 4. Make code highly extensible and extremely friendly for contributions. No
> one would want to spend years of learning to make a contribution. This is
> why I want to see a lot of modules that are highly cohesive and define
> clear APIs for interaction with each other. This is also about paying old
> technical debts related to fat JDBC client, copy of web server in Drill on
> YARN, mixing everything in exec module, etc.
>
> 5. Focus on performance improvements of every component, from query
> planning to execution.
>
> These are my thoughts from developer's perspective. Since I'm just
> developer from Ukraine and far far away from Drill users, I believe that
> Charles Givre is the one who can build a strong Drill user community and
> collect their requirements for us.
>
>
> What relates to Volodymyr's suggestion about adapting Arrow and Drill
> vectors to work together (the same step is required to implement an Arrow
> client, suggested by Paul).
> I'm totally against the idea because it brings a huge amount of unnecessary
> complexity just to uncover small insides into the integration. First is
> that this is against the whole idea of Arrow since the main idea of Arrow
> is to provide unified columnar memory layout between different tools
> without any data conversions. But the step exactly requires data
> conversions, at least our nullability vector and their validity bitmaps are
> not the same, also Dict vector and their meaning of Dict may also cause
> data conversion.
> Another waste is the difference in metadata contracts, who knows whether
> it's even possible to combine them. Another problem, like I already
> mentioned is the huge complexity of the work,
> To do the work I should overcome all underlying pitfalls of both projects,
> in addition, I should cover all the untestable code with a comprehensive
> amount of tests to show that back and forth conversion is done correctly
> for every single unit of data in both vectors. The idea of adapters and
> clients is about 4 years old or more and no one did practical work to
> implement it. I think I explained why.
>
> What I really like in Volodymyr's and Paul's suggestions is that we can
> extract clear API from existing EVF implementation and in practice provide
> Arrow or any other implementation for it. Who knows, maybe with new
> improved garbage collectors using direct memory is not necessary at all? It
> is quite clear what we need the middle layer between operators and memory,
> we need extensive benchmarks over the layer and experiments to show what is
> the best underlying memory for Drill.
>
> What about client tools compatibility there is only one solution I can see
> is to provide new clients for Drill 2.0, although I agree that this is a
> tremendous amount of work there is no other way for making major steps into
> the future. Without it, we should lay back and watch while Drill is slowly
> dying and giving up to its competitors.
>
> NOTE: I want to encourage everyone to join the discussion and share vision
> of what should be included in Drill 2.0 and what are strategic points we
> want to achieve in the future.
>
> Kind regards,
> Igor
>
>
> On Thu, Jan 9, 2020 at 10:12 PM Paul Rogers <[email protected]>
> wrote:
>
> > Hi Volodymyr,
> >
> > All good points. The Arrow/Drill conversion is a good option, especially
> > for readers and clients. Between operators, such conversion is likely to
> > introduce performance hits. As you know, the main feature that
> > differentiates one query engine from another is performance, so adding
> > conversions is unlikely to help Drill in the performance battle.
> >
> > Flatten should actually be pretty simple with EVF. Creating repeated
> > values is much like filling in implicit columns: set a value, then "copy
> > down" n times.
> >
> > Still, you raise good issues. Operators that fit your description are
> > things like exchanges: these operators want to work at the low level of
> > buffers. Not much the column readers/writers can do to help. And, as you
> > point out, commercial components will be a challenge as Apache Drill does
> > not maintain that code.
> >
> > Your larger point is valid: no matter how we approach it, moving to Arrow
> > is a large project that will break compatibility.
> >
> > We've discussed a simple first step: Support an Arrow client to see if
> > there is any interest. Support Arrow readers to see if that gives us any
> > benefit. These are the visible tip of the iceberg. If we see advantages,
> we
> > can then think about changing the internals; the vast bulk of the iceberg
> > which is below water and unseen.
> >
> > I think I disagree that we'd want to swap code that works directly with
> > ValueVectors to code that works directly with ArrowVectors. Doing so
> locks
> > us into someone else's memory format and forces Drill to change every
> time
> > Arrow changes. Give the small size of the Drill team, and the frantic
> pace
> > of Arrow change, This was the team's concern early on and I'm still not
> > convinced this is a good strategy.
> >
> > So, my original point remains: what is the benefit of all this cost? Is
> > there another path that gives us greater benefit for the same or lesser
> > cost? In short, what is our goal? Where do we want Drill to go?
> >
> > In fact, another radical suggestion is to embrace the wonderful work done
> > on Presto. Maybe Drill 2.0 is simply Presto. We focus on adding support
> for
> > local files (Drill's unique strength), and embrace Presto's great support
> > for data types, connectors, UDFs, clients and so on.
> >
> > As a team, we should ask the fundamental question: What benefits can
> Drill
> > offer that are not already offered by, say Presto or commercial Drill
> > derivatives? If new can answer that question, we'll have a better idea
> > about whether investment in Arrow will get us there.
> >
> > Or, are we better off to just leave well enough alone as we have done for
> > several years?
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >     On Thursday, January 9, 2020, 05:57:52 AM PST, Volodymyr Vysotskyi <
> > [email protected]> wrote:
> >
> >  Hi all,
> >
> > Glad to see that this discussion became active again!
> >
> > I have some comments regarding the steps for moving from Drill Vectors to
> > Arrow Vectors.
> >
> > No doubt that using EVF for all operators and readers instead of value
> > vectors will simplify things a lot.
> > But considering the target goal - integration with Arrow, it may be the
> > main show-stopper for it.
> > There may be some operators which would be hard to adapt to use EVF, for
> > example, I think Flatten operator will be among them since its
> > implementation deeply connected with value vectors.
> > Also, it requires moving all storage and format plugins to EVF, which
> also
> > may be problematic, for example, some plugins like MaprDB have specific
> > features, and it should be considered when moving to EVF.
> > Some other plugins are so obsolete, that I'm not sure that they still
> work
> > and that someone still uses it, so except moving to EVF, they should be
> > resurrected to verify that they weren't broken more than before.
> >
> > This is a huge piece of work, and only after that, we will proceed with
> the
> > next step - integrating Arrow to EVF and then handling new Arrow-related
> > issues for all the operators and readers at the same time.
> >
> > I propose to update these steps a little bit.
> > 1. I agree that at first, we should extract EVF-related classes into a
> > separate module.
> > 2. But as the next step, I propose to extract EVF API which doesn't
> depend
> > on the vector implementation (Drill vectors, or Arrow ones).
> > 3. After that, introduce module with Arrow which also implements this EVF
> > API.
> > 4. Introduce transformers that will be able to convert from Drill vectors
> > into Arrow vectors and vice versa. These transformers may be implemented
> to
> > work using EVF abstractions instead of operating with specific vector
> > implementations.
> >
> > 5.1. At this point, we can introduce Arrow connectors to fetch the data
> in
> > Arrow format or return it in such a format using transformers from step
> 4.
> >
> > 5.2. Also, at this point, we may start rewriting operators to EVF and
> > switching EVF implementation from the EVF based on Drill Vectors to the
> > implementation which uses Arrow Vectors. Or switching implementations for
> > existing EVF-based format plugins and fix newly discovered issues in
> Arrow.
> > Since at this point we will have operators which use Arrow format and
> > operators which use Drill Vectors format, we should insert operators that
> > transform one vector format to another introduced in step 4 between every
> > pair of operators which returns batches in a different format.
> >
> > I know, that such an approach requires some additional work, like
> > introducing transformers from step 4 and may cause some performance
> > degradations for the case when format transformation is complex for some
> > types and when we still have sequences of operators with different
> formats.
> >
> > But with this approach, transitioning to Arrow wouldn't be blocked until
> > everything is moved to EVF and it would be possible to transmit
> > step-by-step, and Drill still will be able to switch between formats if
> it
> > would be required.
> >
> > Kind regards,
> > Volodymyr Vysotskyi
> >
> >
> > On Thu, Jan 9, 2020 at 2:45 PM Igor Guzenko <[email protected]>
> > wrote:
> >
> > > Hi Paul,
> > >
> > > Though I have very limited knowledge about Arrow at the moment, I can
> > > highlight a few advantages of trying it:
> > >  1. Allows fixing all the long-standing nullability issues and provide
> > > better integration for storage plugins like Hive.
> > >                https://jira.apache.org/jira/browse/DRILL-1344
> > >                https://jira.apache.org/jira/browse/DRILL-3831
> > >                https://jira.apache.org/jira/browse/DRILL-4824
> > >                https://jira.apache.org/jira/browse/DRILL-7255
> > >                https://jira.apache.org/jira/browse/DRILL-7366
> > > 2. Some work was done by community to implement optimized Arrow readers
> > for
> > > Parquet and other formats&tools. We could try to adopt and check
> whether
> > we
> > > can benefit from them.
> > > 3. Since Arrow is under active development we could try their newest
> > > features, like Flight which promises improved data transfers over the
> > > network.
> > >
> > > Thanks,
> > > Igor
> > > On Wed, Jan 8, 2020 at 11:55 PM Paul Rogers <[email protected]
> >
> > > wrote:
> > >
> > > > Hi Igor,
> > > >
> > > > Before diving into design issues, it may be worthwhile to think about
> > the
> > > > premise: should Drill adopt Arrow as its internal memory layout? This
> > is
> > > > the question that the team has wrestled with since Arrow was
> launched.
> > > > Arrow has three parts. Let's think about each.
> > > >
> > > > First is a direct memory layout. The approach you suggest will let us
> > > work
> > > > with the Arrow memory format. Use EVF to access vectors, then the
> > > > underlying vectors can be swapped from Drill to Arrow. But, what is
> the
> > > > advantage of using Arrow? The arrow layout isn't better than Drill's;
> > it
> > > is
> > > > just different. Adopting the Arrow memory layout by itself provides
> > > little
> > > > benefit, but bit cost. This is one reason the team has been so
> > reluctant
> > > to
> > > > atop Arrow.
> > > >
> > > > The only advantage of using the Arrow memory layout is if Drill could
> > > > benefit from code written for Arrow. The second part of Arrow is a
> set
> > of
> > > > modules to manipulate vectors. Gandiva is the most prominent example.
> > > > However, there are major challenges. Most SQL operations are defined
> to
> > > > work on rows; some clever thinking will be needed to convert those
> > > > operations into a series of column operations. (Drill's codegen is
> NOT
> > > > columnar: it works row-by-row.) So, if we want to benefit from
> Gandiva,
> > > we
> > > > must completely rethink how we process batches.
> > > >
> > > > Is it worth doing all that work? The primary benefit would be
> > > performance.
> > > > But, it is not clear that our current implementation is the
> bottleneck.
> > > The
> > > > current implementation is row-based, code generated in Java. Would be
> > > great
> > > > for someone to do some benchmarks to show the benefit from adopting
> > > Gandiva
> > > > to see if the potential gain justifies the likely large development
> > cost.
> > > >
> > > > The third advantage of using Arrow is to allow exchange of vectors
> > > between
> > > > Drill and Arrow-based clients or readers. As it turns out, this is
> not
> > > the
> > > > big win it seems. As we've discussed, we could easily create an
> > > Arrow-based
> > > > client for Drill -- there will be an RPC between the client and Drill
> > and
> > > > we can use that to do format conversion.
> > > >
> > > > For readers, Drill will want control over batch sizes; Drill cannot
> > > > blindly accept whatever size vectors a reader chooses to produce.
> (More
> > > on
> > > > that later.) Incoming data will be subject to projection and
> selection,
> > > so
> > > > it will quickly move out of the incoming Arrow vectors into vector
> > which
> > > > Drill creates.
> > > >
> > > > Arrow gets (or got) a lot of press. However, our job is to focus on
> > > what's
> > > > best for Drill. There actually might be a memory layout for Drill
> that
> > is
> > > > better than Arrow (and better than our current vectors.) A couple of
> us
> > > did
> > > > a prototype some time ago that seemed to show promise. So, it is not
> > > clear
> > > > that adopting Arrow is necessarily a huge win: maybe it is, maybe
> not.
> > We
> > > > need to figure it out.
> > > >
> > > > What IS clearly a huge win is the idea you outlined: creating a layer
> > > > between memory layout and the rest of Drill so that we can try out
> > > > different memory layouts to see what works best.
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > >
> > > >    On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko <
> > > > [email protected]> wrote:
> > > >
> > > >  Hello Paul,
> > > >
> > > > I totally agree that integrating Arrow by simply replacing Vectors
> > usage
> > > > everywhere will cause a disaster.
> > > > After the first look at the new *E*nhanced*V*ector*F*ramework and
> based
> > > on
> > > > your suggestions I think I have an idea to share.
> > > > In my opinion, the integration can be done in the two major stages:
> > > >
> > > > *1. Preparation Stage*
> > > >      1.1 Extract all EVF and related components to a separate module.
> > So
> > > > the new separate module will depend only upon Vectors module.
> > > >      1.2 Step-by-step rewriting of all operators to use a
> higher-level
> > > > EVF module and remove Vectors module from exec and modules
> > dependencies.
> > > >      1.3 Ensure that only module which depends on Vectors is the new
> > EVF
> > > > one.
> > > > *2. Integration Stage*
> > > >        2.1 Add dependency on Arrow Vectors module into EVF module.
> > > >        2.2 Replace all usages of Drill Vectors & Protobuf Meta with
> > > Arrow
> > > > Vectors & Flatbuffers Meta in EVF module.
> > > >        2.3 Finalize integration by removing Drill Vectors module
> > > > completely.
> > > >
> > > >
> > > > *NOTE:* I think that any way we won't preserve any backward
> > compatibility
> > > > for drivers and custom UDFs.
> > > > And proposed changes are a major step forward to be included in Drill
> > 2.0
> > > > version.
> > > >
> > > >
> > > > Below is the very first list of packages that in future may be
> > > transformed
> > > > into EVF module:
> > > > *Module:* exec/Vectors
> > > > *Packages:*
> > > > org.apache.drill.exec.record.metadata - (An enhanced set of classes
> to
> > > > describe a Drill schema.)
> > > > org.apache.drill.exec.record.metadata.schema.parser
> > > >
> > > > org.apache.drill.exec.vector.accessor - (JSON-like readers and
> writers
> > > for
> > > > each kind of Drill vector.)
> > > > org.apache.drill.exec.vector.accessor.convert
> > > > org.apache.drill.exec.vector.accessor.impl
> > > > org.apache.drill.exec.vector.accessor.reader
> > > > org.apache.drill.exec.vector.accessor.writer
> > > > org.apache.drill.exec.vector.accessor.writer.dummy
> > > >
> > > > *Module:* exec/Java Execution Engine
> > > > *Packages:*
> > > > org.apache.drill.exec.physical.rowSet - (Record batches management)
> > > > org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with
> memory
> > > > mgmt)
> > > > org.apache.drill.exec.physical.impl.scan - (Row set based scan)
> > > >
> > > > Thanks,
> > > > Igor Guzenko
> > > >
> > > >
> > >
> >
>

Re: About integration of drill and arrow

Reply via email to