Hi Saurabh, Here is my two cents, FWIW.
Arrow integration is not about speed; Arrow’s memory layout and operations are very much like Drill’s (not surprising; they evolved from Drill’s value vectors.) Rather, the value of integration is the integration itself. Arrow allows Drill to get out of the business of managing its own internal data format, delegating that task to the Arrow project. This would be parallel to the existing pattern in which Drill delegates its planning function to Calcite. The key advantage Arrow is a common in-memory format. Arrow allows other tools to reach inside Drill’s memory and use the common Arrow format. (But, is this even possible or desirable? Probably not.) Arrow also defines an over-the-wire format so that the serialized data from Drill matches that of other tools. (But, Drill still must provide its own RPC format, which mostly negates the advantages of the common wire format.) In my mind, it is an open question if either of these are actual benefits, or are of more important than optimal performance. At a technical level, what we probably want is an “Arrow 2.0”: taking ideas from Arrow and Drill, and evolving them to a higher-performance next generation format that does, in fact, help Drill (and other Arrow users) obtain better performance and better memory management. For example, Drill, at present, does not actually exploit the ability of operations to be vectorized. We’d perhaps see a performance gain if we revised our row-wise operations to be column-wise. Arrow neither helps nor hurts this effort (though the Arrow API does place constraints on what we can do with our in-memory format.) As Drill matures, effective memory management becomes increasingly important. But, Arrow’s memory layout is very much like the existing Drill vectors. Both have serious memory design issues: they allocate random-sized (really, power-of-two) blocks in random order which is known to be a very poor choice in DB systems for a variety of reasons. Better is to implement vectors as chains of fixed-size blocks leading to vastly simpler memory management. (Indeed, most classic DB product go the route of fixed-size buffers, or at least a small set of buffer sizes.) All that said, Arrow has done some very nice things with vector metadata; that area is much improved over the current Drill version. Won’t help performance, but it is a cleaner implementation. In the end, there is no substitute for actual experimentation to see which is fastest. I wonder, has anyone done a Drill vector vs. Arrow vector performance comparison? Thanks, - Paul > On Nov 16, 2017, at 10:52 AM, Saurabh Mahapatra > <[email protected]> wrote: > > Hi all, > > I wanted to get some thoughts on leveraging Apache Arrow for improving > Drill speed. I believe this was discussed in the Drill hackathon in > September. > > So what was decided? Any thoughts are more than welcome. > > Am I right when I say that leveraging an in-memory representation like > Arrow is not the same as actually delivering the goods i.e. delivering on > performance in an ad hoc environment. Intelligent caching design is a > different problem? > > Best, > Saurabh
