Re: Apache Arrow Integration

Paul Rogers Thu, 16 Nov 2017 11:35:46 -0800

Hi Saurabh,

Here is my two cents, FWIW.

Arrow integration is not about speed; Arrow’s memory layout and operations are 
very much like Drill’s (not surprising; they evolved from Drill’s value 
vectors.) Rather, the value of integration is the integration itself.

Arrow allows Drill to get out of the business of managing its own internal data 
format, delegating that task to the Arrow project. This would be parallel to 
the existing pattern in which Drill delegates its planning function to Calcite.

The key advantage Arrow is a common in-memory format. Arrow allows other tools 
to reach inside Drill’s memory and use the common Arrow format. (But, is this 
even possible or desirable? Probably not.)

Arrow also defines an over-the-wire format so that the serialized data from 
Drill matches that of other tools. (But, Drill still must provide its own RPC 
format, which mostly negates the advantages of the common wire format.)

In my mind, it is an open question if either of these are actual benefits, or 
are of more important than optimal performance.

At a technical level, what we probably want is an “Arrow 2.0”: taking ideas 
from Arrow and Drill, and evolving them to a higher-performance next generation 
format that does, in fact, help Drill (and other Arrow users) obtain better 
performance and better memory management.

For example, Drill, at present, does not actually exploit the ability of 
operations to be vectorized. We’d perhaps see a performance gain if we revised 
our row-wise operations to be column-wise. Arrow neither helps nor hurts this 
effort (though the Arrow API does place constraints on what we can do with our 
in-memory format.)

As Drill matures, effective memory management becomes increasingly important. 
But, Arrow’s memory layout is very much like the existing Drill vectors. Both 
have serious memory design issues: they allocate random-sized (really, 
power-of-two) blocks in random order which is known to be a very poor choice in 
DB systems for a variety of reasons. Better is to implement vectors as chains 
of fixed-size blocks leading to vastly simpler memory management. (Indeed, most 
classic DB product go the route of fixed-size buffers, or at least a small set 
of buffer sizes.)

All that said, Arrow has done some very nice things with vector metadata; that 
area is much improved over the current Drill version. Won’t help performance, 
but it is a cleaner implementation.

In the end, there is no substitute for actual experimentation to see which is 
fastest. I wonder, has anyone done a Drill vector vs. Arrow vector performance 
comparison?

Thanks,

- Paul

> On Nov 16, 2017, at 10:52 AM, Saurabh Mahapatra 
> <[email protected]> wrote:
> 
> Hi all,
> 
> I wanted to get some thoughts on leveraging Apache Arrow for improving
> Drill speed. I believe this was discussed in the Drill hackathon in
> September.
> 
> So what was decided? Any thoughts are more than welcome.
> 
> Am I right when I say that leveraging an in-memory representation like
> Arrow is not the same as actually delivering the goods i.e. delivering on
> performance in an ad hoc environment. Intelligent caching design is a
> different problem?
> 
> Best,
> Saurabh

Re: Apache Arrow Integration

Reply via email to