Re: About integration of drill and arrow

Ted Dunning Wed, 15 Jan 2020 14:38:23 -0800

Jiang,

It is sooo cool to hear from actual users in the real world.


I would confirm that I have had real problems using drill on nested data.
My particular problem wasn't lack of functions, however. It had to do with
the fact that without nullable members of structures, I couldn't tell when
fields were missing.



On Wed, Jan 15, 2020 at 2:31 PM Jiang Wu <[email protected]>
wrote:

> An interesting set of perspectives.  The market has many systems similar to
> Drill dealing with relational data model.  However, there are a large set
> of non-relational data from various APIs.  An efficient and extensible
> query engine for this type of non-relational schema-on-demand data is what
> we are looking for.
>
> Here are our perspectives on developing and using Drill:
>
> 1) Schema on-demand and non-relational model: this is the primary reason.
> We use Drill to interface with a schema-less columnar object store, where
> objects in a collection don't need to have uniform schema.
> 2) Small foot-print: we use both embedded and clustered mode
>
> What we find lacking in Drill:
>
> 1) Support for non-relational data model is still very limited.  e.g.
> lacking functions that work directly on non-relational values
> 2) Documentation.  Requires a lot of expertise and experiences to figure
> out how things work.
> 3) Not widely adopted causing issues with finding experts to continue our
> work.
>
> -- Jiang
>
>
> On Fri, Jan 10, 2020 at 3:48 AM Igor Guzenko <[email protected]>
> wrote:
>
> > ---------- Forwarded message ---------
> > From: Igor Guzenko <[email protected]>
> > Date: Fri, Jan 10, 2020 at 1:46 PM
> > Subject: Re: About integration of drill and arrow
> > To: dev <[email protected]>
> >
> >
> > Hello Drill Developers and Drill Users,
> >
> > This discussion started as migration to Arrow but uncovered questions of
> > strategical plans for moving towards Apache Drill 2.0.
> > Below are my personal thoughts of what we, as developers, should do to
> > offer Drill users better experience:
> >
> > 1. High performant bulk insertions into as many data sources as possible.
> > There is a whole bunch of different tools for data pipelining to use...
> > But why people who know SQL should spend time learning something new for
> > simply moving data between tools?
> >
> > 2. Improve the efficiency of memory management (EVF, resource management,
> > improved costs planning using meta store, etc.). Since we're dealing with
> > big data alongside other tools installed on data nodes we should utilize
> > memory very economically and effectively.
> >
> > 3. Make integration with all other tools and formats as stable as
> possible.
> > The high amount of bugs in the area tells that we have lots to improve.
> > Every user is happy when he gets a tool and it simply works as expected.
> > Also, analyze user requirements and provide integration with new most
> > popular tools.  Querying high variety of
> > data sources were and still one of the biggest selling points.
> >
> > 4. Make code highly extensible and extremely friendly for contributions.
> No
> > one would want to spend years of learning to make a contribution. This is
> > why I want to see a lot of modules that are highly cohesive and define
> > clear APIs for interaction with each other. This is also about paying old
> > technical debts related to fat JDBC client, copy of web server in Drill
> on
> > YARN, mixing everything in exec module, etc.
> >
> > 5. Focus on performance improvements of every component, from query
> > planning to execution.
> >
> > These are my thoughts from developer's perspective. Since I'm just
> > developer from Ukraine and far far away from Drill users, I believe that
> > Charles Givre is the one who can build a strong Drill user community and
> > collect their requirements for us.
> >
> >
> > What relates to Volodymyr's suggestion about adapting Arrow and Drill
> > vectors to work together (the same step is required to implement an Arrow
> > client, suggested by Paul).
> > I'm totally against the idea because it brings a huge amount of
> unnecessary
> > complexity just to uncover small insides into the integration. First is
> > that this is against the whole idea of Arrow since the main idea of Arrow
> > is to provide unified columnar memory layout between different tools
> > without any data conversions. But the step exactly requires data
> > conversions, at least our nullability vector and their validity bitmaps
> are
> > not the same, also Dict vector and their meaning of Dict may also cause
> > data conversion.
> > Another waste is the difference in metadata contracts, who knows whether
> > it's even possible to combine them. Another problem, like I already
> > mentioned is the huge complexity of the work,
> > To do the work I should overcome all underlying pitfalls of both
> projects,
> > in addition, I should cover all the untestable code with a comprehensive
> > amount of tests to show that back and forth conversion is done correctly
> > for every single unit of data in both vectors. The idea of adapters and
> > clients is about 4 years old or more and no one did practical work to
> > implement it. I think I explained why.
> >
> > What I really like in Volodymyr's and Paul's suggestions is that we can
> > extract clear API from existing EVF implementation and in practice
> provide
> > Arrow or any other implementation for it. Who knows, maybe with new
> > improved garbage collectors using direct memory is not necessary at all?
> It
> > is quite clear what we need the middle layer between operators and
> memory,
> > we need extensive benchmarks over the layer and experiments to show what
> is
> > the best underlying memory for Drill.
> >
> > What about client tools compatibility there is only one solution I can
> see
> > is to provide new clients for Drill 2.0, although I agree that this is a
> > tremendous amount of work there is no other way for making major steps
> into
> > the future. Without it, we should lay back and watch while Drill is
> slowly
> > dying and giving up to its competitors.
> >
> > NOTE: I want to encourage everyone to join the discussion and share
> vision
> > of what should be included in Drill 2.0 and what are strategic points we
> > want to achieve in the future.
> >
> > Kind regards,
> > Igor
> >
> >
> > On Thu, Jan 9, 2020 at 10:12 PM Paul Rogers <[email protected]>
> > wrote:
> >
> > > Hi Volodymyr,
> > >
> > > All good points. The Arrow/Drill conversion is a good option,
> especially
> > > for readers and clients. Between operators, such conversion is likely
> to
> > > introduce performance hits. As you know, the main feature that
> > > differentiates one query engine from another is performance, so adding
> > > conversions is unlikely to help Drill in the performance battle.
> > >
> > > Flatten should actually be pretty simple with EVF. Creating repeated
> > > values is much like filling in implicit columns: set a value, then
> "copy
> > > down" n times.
> > >
> > > Still, you raise good issues. Operators that fit your description are
> > > things like exchanges: these operators want to work at the low level of
> > > buffers. Not much the column readers/writers can do to help. And, as
> you
> > > point out, commercial components will be a challenge as Apache Drill
> does
> > > not maintain that code.
> > >
> > > Your larger point is valid: no matter how we approach it, moving to
> Arrow
> > > is a large project that will break compatibility.
> > >
> > > We've discussed a simple first step: Support an Arrow client to see if
> > > there is any interest. Support Arrow readers to see if that gives us
> any
> > > benefit. These are the visible tip of the iceberg. If we see
> advantages,
> > we
> > > can then think about changing the internals; the vast bulk of the
> iceberg
> > > which is below water and unseen.
> > >
> > > I think I disagree that we'd want to swap code that works directly with
> > > ValueVectors to code that works directly with ArrowVectors. Doing so
> > locks
> > > us into someone else's memory format and forces Drill to change every
> > time
> > > Arrow changes. Give the small size of the Drill team, and the frantic
> > pace
> > > of Arrow change, This was the team's concern early on and I'm still not
> > > convinced this is a good strategy.
> > >
> > > So, my original point remains: what is the benefit of all this cost? Is
> > > there another path that gives us greater benefit for the same or lesser
> > > cost? In short, what is our goal? Where do we want Drill to go?
> > >
> > > In fact, another radical suggestion is to embrace the wonderful work
> done
> > > on Presto. Maybe Drill 2.0 is simply Presto. We focus on adding support
> > for
> > > local files (Drill's unique strength), and embrace Presto's great
> support
> > > for data types, connectors, UDFs, clients and so on.
> > >
> > > As a team, we should ask the fundamental question: What benefits can
> > Drill
> > > offer that are not already offered by, say Presto or commercial Drill
> > > derivatives? If new can answer that question, we'll have a better idea
> > > about whether investment in Arrow will get us there.
> > >
> > > Or, are we better off to just leave well enough alone as we have done
> for
> > > several years?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > >
> > >     On Thursday, January 9, 2020, 05:57:52 AM PST, Volodymyr Vysotskyi
> <
> > > [email protected]> wrote:
> > >
> > >  Hi all,
> > >
> > > Glad to see that this discussion became active again!
> > >
> > > I have some comments regarding the steps for moving from Drill Vectors
> to
> > > Arrow Vectors.
> > >
> > > No doubt that using EVF for all operators and readers instead of value
> > > vectors will simplify things a lot.
> > > But considering the target goal - integration with Arrow, it may be the
> > > main show-stopper for it.
> > > There may be some operators which would be hard to adapt to use EVF,
> for
> > > example, I think Flatten operator will be among them since its
> > > implementation deeply connected with value vectors.
> > > Also, it requires moving all storage and format plugins to EVF, which
> > also
> > > may be problematic, for example, some plugins like MaprDB have specific
> > > features, and it should be considered when moving to EVF.
> > > Some other plugins are so obsolete, that I'm not sure that they still
> > work
> > > and that someone still uses it, so except moving to EVF, they should be
> > > resurrected to verify that they weren't broken more than before.
> > >
> > > This is a huge piece of work, and only after that, we will proceed with
> > the
> > > next step - integrating Arrow to EVF and then handling new
> Arrow-related
> > > issues for all the operators and readers at the same time.
> > >
> > > I propose to update these steps a little bit.
> > > 1. I agree that at first, we should extract EVF-related classes into a
> > > separate module.
> > > 2. But as the next step, I propose to extract EVF API which doesn't
> > depend
> > > on the vector implementation (Drill vectors, or Arrow ones).
> > > 3. After that, introduce module with Arrow which also implements this
> EVF
> > > API.
> > > 4. Introduce transformers that will be able to convert from Drill
> vectors
> > > into Arrow vectors and vice versa. These transformers may be
> implemented
> > to
> > > work using EVF abstractions instead of operating with specific vector
> > > implementations.
> > >
> > > 5.1. At this point, we can introduce Arrow connectors to fetch the data
> > in
> > > Arrow format or return it in such a format using transformers from step
> > 4.
> > >
> > > 5.2. Also, at this point, we may start rewriting operators to EVF and
> > > switching EVF implementation from the EVF based on Drill Vectors to the
> > > implementation which uses Arrow Vectors. Or switching implementations
> for
> > > existing EVF-based format plugins and fix newly discovered issues in
> > Arrow.
> > > Since at this point we will have operators which use Arrow format and
> > > operators which use Drill Vectors format, we should insert operators
> that
> > > transform one vector format to another introduced in step 4 between
> every
> > > pair of operators which returns batches in a different format.
> > >
> > > I know, that such an approach requires some additional work, like
> > > introducing transformers from step 4 and may cause some performance
> > > degradations for the case when format transformation is complex for
> some
> > > types and when we still have sequences of operators with different
> > formats.
> > >
> > > But with this approach, transitioning to Arrow wouldn't be blocked
> until
> > > everything is moved to EVF and it would be possible to transmit
> > > step-by-step, and Drill still will be able to switch between formats if
> > it
> > > would be required.
> > >
> > > Kind regards,
> > > Volodymyr Vysotskyi
> > >
> > >
> > > On Thu, Jan 9, 2020 at 2:45 PM Igor Guzenko <
> [email protected]>
> > > wrote:
> > >
> > > > Hi Paul,
> > > >
> > > > Though I have very limited knowledge about Arrow at the moment, I can
> > > > highlight a few advantages of trying it:
> > > >  1. Allows fixing all the long-standing nullability issues and
> provide
> > > > better integration for storage plugins like Hive.
> > > >                https://jira.apache.org/jira/browse/DRILL-1344
> > > >                https://jira.apache.org/jira/browse/DRILL-3831
> > > >                https://jira.apache.org/jira/browse/DRILL-4824
> > > >                https://jira.apache.org/jira/browse/DRILL-7255
> > > >                https://jira.apache.org/jira/browse/DRILL-7366
> > > > 2. Some work was done by community to implement optimized Arrow
> readers
> > > for
> > > > Parquet and other formats&tools. We could try to adopt and check
> > whether
> > > we
> > > > can benefit from them.
> > > > 3. Since Arrow is under active development we could try their newest
> > > > features, like Flight which promises improved data transfers over the
> > > > network.
> > > >
> > > > Thanks,
> > > > Igor
> > > > On Wed, Jan 8, 2020 at 11:55 PM Paul Rogers
> <[email protected]
> > >
> > > > wrote:
> > > >
> > > > > Hi Igor,
> > > > >
> > > > > Before diving into design issues, it may be worthwhile to think
> about
> > > the
> > > > > premise: should Drill adopt Arrow as its internal memory layout?
> This
> > > is
> > > > > the question that the team has wrestled with since Arrow was
> > launched.
> > > > > Arrow has three parts. Let's think about each.
> > > > >
> > > > > First is a direct memory layout. The approach you suggest will let
> us
> > > > work
> > > > > with the Arrow memory format. Use EVF to access vectors, then the
> > > > > underlying vectors can be swapped from Drill to Arrow. But, what is
> > the
> > > > > advantage of using Arrow? The arrow layout isn't better than
> Drill's;
> > > it
> > > > is
> > > > > just different. Adopting the Arrow memory layout by itself provides
> > > > little
> > > > > benefit, but bit cost. This is one reason the team has been so
> > > reluctant
> > > > to
> > > > > atop Arrow.
> > > > >
> > > > > The only advantage of using the Arrow memory layout is if Drill
> could
> > > > > benefit from code written for Arrow. The second part of Arrow is a
> > set
> > > of
> > > > > modules to manipulate vectors. Gandiva is the most prominent
> example.
> > > > > However, there are major challenges. Most SQL operations are
> defined
> > to
> > > > > work on rows; some clever thinking will be needed to convert those
> > > > > operations into a series of column operations. (Drill's codegen is
> > NOT
> > > > > columnar: it works row-by-row.) So, if we want to benefit from
> > Gandiva,
> > > > we
> > > > > must completely rethink how we process batches.
> > > > >
> > > > > Is it worth doing all that work? The primary benefit would be
> > > > performance.
> > > > > But, it is not clear that our current implementation is the
> > bottleneck.
> > > > The
> > > > > current implementation is row-based, code generated in Java. Would
> be
> > > > great
> > > > > for someone to do some benchmarks to show the benefit from adopting
> > > > Gandiva
> > > > > to see if the potential gain justifies the likely large development
> > > cost.
> > > > >
> > > > > The third advantage of using Arrow is to allow exchange of vectors
> > > > between
> > > > > Drill and Arrow-based clients or readers. As it turns out, this is
> > not
> > > > the
> > > > > big win it seems. As we've discussed, we could easily create an
> > > > Arrow-based
> > > > > client for Drill -- there will be an RPC between the client and
> Drill
> > > and
> > > > > we can use that to do format conversion.
> > > > >
> > > > > For readers, Drill will want control over batch sizes; Drill cannot
> > > > > blindly accept whatever size vectors a reader chooses to produce.
> > (More
> > > > on
> > > > > that later.) Incoming data will be subject to projection and
> > selection,
> > > > so
> > > > > it will quickly move out of the incoming Arrow vectors into vector
> > > which
> > > > > Drill creates.
> > > > >
> > > > > Arrow gets (or got) a lot of press. However, our job is to focus on
> > > > what's
> > > > > best for Drill. There actually might be a memory layout for Drill
> > that
> > > is
> > > > > better than Arrow (and better than our current vectors.) A couple
> of
> > us
> > > > did
> > > > > a prototype some time ago that seemed to show promise. So, it is
> not
> > > > clear
> > > > > that adopting Arrow is necessarily a huge win: maybe it is, maybe
> > not.
> > > We
> > > > > need to figure it out.
> > > > >
> > > > > What IS clearly a huge win is the idea you outlined: creating a
> layer
> > > > > between memory layout and the rest of Drill so that we can try out
> > > > > different memory layouts to see what works best.
> > > > >
> > > > > Thanks,
> > > > > - Paul
> > > > >
> > > > >
> > > > >
> > > > >    On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko <
> > > > > [email protected]> wrote:
> > > > >
> > > > >  Hello Paul,
> > > > >
> > > > > I totally agree that integrating Arrow by simply replacing Vectors
> > > usage
> > > > > everywhere will cause a disaster.
> > > > > After the first look at the new *E*nhanced*V*ector*F*ramework and
> > based
> > > > on
> > > > > your suggestions I think I have an idea to share.
> > > > > In my opinion, the integration can be done in the two major stages:
> > > > >
> > > > > *1. Preparation Stage*
> > > > >      1.1 Extract all EVF and related components to a separate
> module.
> > > So
> > > > > the new separate module will depend only upon Vectors module.
> > > > >      1.2 Step-by-step rewriting of all operators to use a
> > higher-level
> > > > > EVF module and remove Vectors module from exec and modules
> > > dependencies.
> > > > >      1.3 Ensure that only module which depends on Vectors is the
> new
> > > EVF
> > > > > one.
> > > > > *2. Integration Stage*
> > > > >        2.1 Add dependency on Arrow Vectors module into EVF module.
> > > > >        2.2 Replace all usages of Drill Vectors & Protobuf Meta with
> > > > Arrow
> > > > > Vectors & Flatbuffers Meta in EVF module.
> > > > >        2.3 Finalize integration by removing Drill Vectors module
> > > > > completely.
> > > > >
> > > > >
> > > > > *NOTE:* I think that any way we won't preserve any backward
> > > compatibility
> > > > > for drivers and custom UDFs.
> > > > > And proposed changes are a major step forward to be included in
> Drill
> > > 2.0
> > > > > version.
> > > > >
> > > > >
> > > > > Below is the very first list of packages that in future may be
> > > > transformed
> > > > > into EVF module:
> > > > > *Module:* exec/Vectors
> > > > > *Packages:*
> > > > > org.apache.drill.exec.record.metadata - (An enhanced set of classes
> > to
> > > > > describe a Drill schema.)
> > > > > org.apache.drill.exec.record.metadata.schema.parser
> > > > >
> > > > > org.apache.drill.exec.vector.accessor - (JSON-like readers and
> > writers
> > > > for
> > > > > each kind of Drill vector.)
> > > > > org.apache.drill.exec.vector.accessor.convert
> > > > > org.apache.drill.exec.vector.accessor.impl
> > > > > org.apache.drill.exec.vector.accessor.reader
> > > > > org.apache.drill.exec.vector.accessor.writer
> > > > > org.apache.drill.exec.vector.accessor.writer.dummy
> > > > >
> > > > > *Module:* exec/Java Execution Engine
> > > > > *Packages:*
> > > > > org.apache.drill.exec.physical.rowSet - (Record batches management)
> > > > > org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with
> > memory
> > > > > mgmt)
> > > > > org.apache.drill.exec.physical.impl.scan - (Row set based scan)
> > > > >
> > > > > Thanks,
> > > > > Igor Guzenko
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: About integration of drill and arrow

Reply via email to