The casting issue seems like a real bug. People want to do things like "dir0 > 2012"
On Tue, Feb 3, 2015 at 6:00 PM, Andries Engelbrecht < [email protected]> wrote: > Thanks. > > It will be good for users to understand the specifics of directory pruning. > > As an additional note is is important to not cast the data typeof the dir > filter and to provide a string (i.e. dir0=‘2015’) for pruning to work > properly. > With dir0=2015 the query to works, but the directories are no pruned > > Similar if a view is created with columns for dir0, dir1, etc. the data > types should not be casted or converted, based on current observations. > > It may be good to make it a bit friendlier for a better user experience, > will file an enhancement request. > > —Andries > > > On Feb 3, 2015, at 5:35 PM, Aman Sinha <[email protected]> wrote: > > > Yes, that's the expected behavior for now. Directory pruning where only > > subdirectory is specified is logically equivalent to wildcard matching - > > '*/*/10' which is not supported yet. You could open an enhancement > > request. > > > > On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht < > > [email protected]> wrote: > > > >> Is it required for the directory pruning to work that a top down filter > of > >> directories be applied? > >> > >> My current observation is that for a directory structure as listed > below, > >> the pruning only works if the full tree is provided. If only a lower > level > >> directory is supplied in the filter condition Drill only uses it as a > >> filter. > >> > >> /2015 > >> /01 > >> /10 > >> /11 > >> /12 > >> /13 > >> /14 > >> > >> select count(id) from `/foo` t where dir0='2015' and dir1='01' and > >> dir2='10' > >> Produces the correct pruning and query plan > >> 01-02 Project(id=[$3]): rowcount = 3670316.0, cumulative > cost = > >> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, > id = > >> 28434 > >> 01-03 Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]): > >> rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu, > >> 0.0 io, 0.0 network, 0.0 memory}, id = 28433 > >> 01-04 Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, > >> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`] > >> > >> > >> However > >> select count(id) from `/foo` t where dir2='10' > >> Produces full scan of all sub directories and only applies a filter > >> condition after the fact. Notice the numFiles between the 2, even > though it > >> lists columns in the base scan > >> 01-04 Filter(condition=[=($0, '10')]): rowcount = > >> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 > io, > >> 0.0 network, 0.0 memory}, id = 27470 > >> 01-05 Project(dir2=[$1], id=[$0]): rowcount = > >> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0 > >> io, 0.0 network, 0.0 memory}, id = 27469 > >> 01-06 Scan(groupscan=[EasyGroupScan > >> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`] > >> > >> Any thoughts? > >> > >> Thanks > >> > >> —Andries > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >
