Hey Jason, Answers inline below
On Thu, Jun 29, 2017 at 2:52 AM, Jason Heo <jason.heo....@gmail.com> wrote: > Hi, > > Q1. > > After reading Druid vs Kudu > <http://druid.io/docs/latest/comparisons/druid-vs-kudu.html>, I wondered > Druid has aggregation push down. > > *Druid includes its own query layer that allows it to push down >> aggregations and computations directly to data nodes for faster query >> processing. * > > > If I understand "Aggregation Push down" correctly, it seems that partial > aggregation is done by data node side, so that only small amount of result > set can be transferred to a client which could lead to great performance > gain. (Am I right?) > That's right. > > So I wanted to know if Apache Kudu has a plan for Aggregation push down > scan feature (Or already has it) > It currently doesn't have it, and there aren't any current plans to do so. Usually, we assume that Kudu tablet servers are collocated with either Impala daemons or Spark executors. So, it's less important to provide pushdown into the tablet server itself, since even without it, we are typically avoiding any network transfer from the TS into the execution environment which does the aggregation. The above would be less true if there were some way in which Kudu itself could perform the aggregation more efficiently based on knowledge of its underlying storage. For example, one could imagine doing a GROUP BY 'foo_col' more efficiently within Kudu if the column is dictionary-encoded by aggregating on the code-words rather than the resulting decoded strings, since the integer code words are fixed length and faster to hash, compare, etc. That said, it hasn't been a high priority relative to other performance areas we're exploring. > > Q2. > > One thing that I concern when using Impala+Kudu is that all matching rows > should transferred to impala process from kudu tserver. Usually Impala and > Kudu tserver run on same node. So It would be happy If Impala can read Kudu > Tablet directly. Any plan for this kind of features? > > How-to: Use Impala and Kudu Together for Analytic Workloads > <https://blog.cloudera.com/blog/2016/04/how-to-use-impala-and-kudu-together-for-analytic-workloads/> > says that: > > *we intend to implement the Apache Arrow in-memory data format and to >> share memory between Kudu and Impala, which we expect will help with >> performance and resource usage.* >> > > What does "share memory between Kudu and Impala"? Does this already > implemented? > Yes, currently all matching rows are transferred from the Kudu TS to the Impala daemon. Impala schedules scanners for locality, though, so this is a localhost-scoped connection which is quite fast. To give you a sense of the speed, I just tested a localhost TCP connection using 'iperf' and measured ~6GB/sec on a single core. Although this is significantly slower than a within-process memcpy, it's still fast enough that it usually represents a small fraction of the overall CPU consumption of a query. Regarding sharing memory, the first step which is already implemented is to share a common in-memory layout. That is to say, the format in which Kudu returns rows over the wire to the client matches the same format that Impala expects its rows in memory. So, when it receives a block of rows in the scanner, it doesn't have to "parse" or "convert" them to some other format. It can simply interpret the data in place. This saves a lot of CPU. Using something like Arrow would be even more efficient than the current format since it is columnar rather than row-oriented. However, Impala currently does not use a columnar format for its operator pipeline, so we can't currently make use of Arrow to optimize the interchange. Currently, as mentioned above, the data (in the common format) is transferred from Kudu to Impala via a localhost TCP socket. A few years ago we had an intern who experimented with using a Unix domain socket and found some small speedup. He also experimented with setting up a shared memory region and also found another small speedup over the domain socket. However, there was a lot of complexity involved in this code (particularly the shared memory approach) relative to the gain that we saw, so we didn't end up merging it before his internship ended :) -Todd -- Todd Lipcon Software Engineer, Cloudera