Yes, that's the expected behavior for now. Directory pruning where only subdirectory is specified is logically equivalent to wildcard matching - '*/*/10' which is not supported yet. You could open an enhancement request.
On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht < [email protected]> wrote: > Is it required for the directory pruning to work that a top down filter of > directories be applied? > > My current observation is that for a directory structure as listed below, > the pruning only works if the full tree is provided. If only a lower level > directory is supplied in the filter condition Drill only uses it as a > filter. > > /2015 > /01 > /10 > /11 > /12 > /13 > /14 > > select count(id) from `/foo` t where dir0='2015' and dir1='01' and > dir2='10' > Produces the correct pruning and query plan > 01-02 Project(id=[$3]): rowcount = 3670316.0, cumulative cost = > {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = > 28434 > 01-03 Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]): > rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu, > 0.0 io, 0.0 network, 0.0 memory}, id = 28433 > 01-04 Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, > numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`] > > > However > select count(id) from `/foo` t where dir2='10' > Produces full scan of all sub directories and only applies a filter > condition after the fact. Notice the numFiles between the 2, even though it > lists columns in the base scan > 01-04 Filter(condition=[=($0, '10')]): rowcount = > 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io, > 0.0 network, 0.0 memory}, id = 27470 > 01-05 Project(dir2=[$1], id=[$0]): rowcount = > 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0 > io, 0.0 network, 0.0 memory}, id = 27469 > 01-06 Scan(groupscan=[EasyGroupScan > [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`] > > Any thoughts? > > Thanks > > —Andries > > > > > > > > > >
