Tomer's answer was excellent, but he didn't address this issue.

HDFS doesn't have enough smarts to allow pushdown of SQL predicates.  The
closest you can come is to use intelligent partitioning (your intelligence,
not that of HDFS, btw). In that case Drill will avoid reading files that it
can avoid reading.



On Sat, Jan 2, 2016 at 1:33 PM, Tomer Shiran <[email protected]> wrote:

> Drill will read the data directly from HDFS in parallel. The performance
> will depend on the size of the Drill cluster, the size of the HDFS cluster,
> and the network. Drill does not translate SQL into MapReduce (the only
> system that works that way is Hive - but that approach lends itself to much
> slower performance particularly for ad-hoc analysis).
>
>
> On Sat, Jan 2, 2016 at 12:28 PM, Shashanka Kuntala <
> [email protected]> wrote:
>
> > I have a use-case where 100s of TB of data is in HDFS. Installing Drill
> on
> > all nodes of the HDFS is not an option.  If I have a separate Apache
> Drill
> > cluster (external to HDFS), how will  Apache Drill SQL perform with large
> > data sets ?  Specifically I would like to know if Drill submits MapReduce
> > jobs on HDFS or does Drill extract all data from HDFS cluster into Drill
> > cluster before applying filters/joins ? Will Drill pushdown SQL into
> HDFS ?
> >
> >
> >
> >
>
>
> --
> Tomer Shiran
> CEO and Co-Founder, Dremio
>

Reply via email to