Even for csv or json format, directory-based Partition pruning [1] could be leveraged to prune data. You have to use the special dir* field in your query to filter out un-wanted data, or define a view which uses dir* field and then query against the view.
1. https://drill.apache.org/docs/partition-pruning/ On Thu, Jul 23, 2015 at 8:09 AM, Abdel Hakim Deneche <[email protected]> wrote: > Hi Hafiz, > > I guess it depends on the query. Generally Drill will try to push any > filter you have in your query to the leaf nodes so they won't send any row > that doesn't pass the filter. Also only the columns that appear in the > query will be loaded from the file. > > The file format you are querying also impacts how much data is read from > disk: in parquet Drill can avoid reading unnecessary columns, but for other > formats (csv or json) Drill will still need to read everything from disk > then discard unneeded columns before sending the remaining data for further > processing. > > Adding a limit to the query can also help, and Drill will stop reading the > data as soon as possible once enough records have been collected. > > On Thu, Jul 23, 2015 at 8:01 AM, Hafiz Mujadid <[email protected]> > wrote: > > > Hi all! > > > > I want to know about drill working. Suppose i query to data on S3. the > > volume of data is huge in GB's. So when I query to that data what > happens? > > whether drill load whole data on drill nodes? or just query data without > > loading whole data? > > > > > > -- > > Abdelhakim Deneche > > Software Engineer > > <http://www.mapr.com/> > > > Now Available - Free Hadoop On-Demand Training > < > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > >
