Thanks, I thought we were saying the SQL "dir0>1;" does not work functionally. Not I got that, it does not do directory pruning.
On Tue, Feb 3, 2015 at 9:47 PM, Jason Altekruse <[email protected]> wrote: > Hao, > > The dir columns are always added to the records coming out of a scan. The > issue is with trying to avoid unneeded reads altogether. If you look at the > query plan you should see that the scan is going to read all of the files > and the filter against the directory column will be applied in a separate > filter operation later. Currently we only support simple expressions, > either equality or an in-list to specify partition filters that can be > pushed into the scan operation. > > -Jason > > On Tue, Feb 3, 2015 at 8:59 PM, Hao Zhu <[email protected]> wrote: > > > Strange, per my testing, we can do that: > > > > 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015` > where > > dir0=1; > > +------------+------------+ > > | columns | dir0 | > > +------------+------------+ > > | ["1","2","3"] | 1 | > > +------------+------------+ > > 1 row selected (0.098 seconds) > > 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015` > where > > dir0>1; > > +------------+------------+ > > | columns | dir0 | > > +------------+------------+ > > | ["1","2","3"] | 3 | > > | ["1","2","3"] | 2 | > > +------------+------------+ > > 2 rows selected (0.18 seconds) > > > > Thanks, > > Hao > > > > On Tue, Feb 3, 2015 at 8:27 PM, Tomer Shiran <[email protected]> wrote: > > > > > The casting issue seems like a real bug. People want to do things like > > > "dir0 > 2012" > > > > > > On Tue, Feb 3, 2015 at 6:00 PM, Andries Engelbrecht < > > > [email protected]> wrote: > > > > > > > Thanks. > > > > > > > > It will be good for users to understand the specifics of directory > > > pruning. > > > > > > > > As an additional note is is important to not cast the data typeof the > > dir > > > > filter and to provide a string (i.e. dir0=‘2015’) for pruning to work > > > > properly. > > > > With dir0=2015 the query to works, but the directories are no pruned > > > > > > > > Similar if a view is created with columns for dir0, dir1, etc. the > data > > > > types should not be casted or converted, based on current > observations. > > > > > > > > It may be good to make it a bit friendlier for a better user > > experience, > > > > will file an enhancement request. > > > > > > > > —Andries > > > > > > > > > > > > On Feb 3, 2015, at 5:35 PM, Aman Sinha <[email protected]> wrote: > > > > > > > > > Yes, that's the expected behavior for now. Directory pruning where > > > only > > > > > subdirectory is specified is logically equivalent to wildcard > > matching > > > - > > > > > '*/*/10' which is not supported yet. You could open an > enhancement > > > > > request. > > > > > > > > > > On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht < > > > > > [email protected]> wrote: > > > > > > > > > >> Is it required for the directory pruning to work that a top down > > > filter > > > > of > > > > >> directories be applied? > > > > >> > > > > >> My current observation is that for a directory structure as listed > > > > below, > > > > >> the pruning only works if the full tree is provided. If only a > lower > > > > level > > > > >> directory is supplied in the filter condition Drill only uses it > as > > a > > > > >> filter. > > > > >> > > > > >> /2015 > > > > >> /01 > > > > >> /10 > > > > >> /11 > > > > >> /12 > > > > >> /13 > > > > >> /14 > > > > >> > > > > >> select count(id) from `/foo` t where dir0='2015' and dir1='01' and > > > > >> dir2='10' > > > > >> Produces the correct pruning and query plan > > > > >> 01-02 Project(id=[$3]): rowcount = 3670316.0, > cumulative > > > > cost = > > > > >> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 > > memory}, > > > > id = > > > > >> 28434 > > > > >> 01-03 Project(dir0=[$0], dir1=[$3], dir2=[$2], > > id=[$1]): > > > > >> rowcount = 3670316.0, cumulative cost = {7340632.0 rows, > 1.468128E7 > > > cpu, > > > > >> 0.0 io, 0.0 network, 0.0 memory}, id = 28433 > > > > >> 01-04 Scan(groupscan=[EasyGroupScan > > > [selectionRoot=/foo, > > > > >> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`] > > > > >> > > > > >> > > > > >> However > > > > >> select count(id) from `/foo` t where dir2='10' > > > > >> Produces full scan of all sub directories and only applies a > filter > > > > >> condition after the fact. Notice the numFiles between the 2, even > > > > though it > > > > >> lists columns in the base scan > > > > >> 01-04 Filter(condition=[=($0, '10')]): rowcount = > > > > >> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, > > 0.0 > > > > io, > > > > >> 0.0 network, 0.0 memory}, id = 27470 > > > > >> 01-05 Project(dir2=[$1], id=[$0]): rowcount = > > > > >> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 > cpu, > > > 0.0 > > > > >> io, 0.0 network, 0.0 memory}, id = 27469 > > > > >> 01-06 Scan(groupscan=[EasyGroupScan > > > > >> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`] > > > > >> > > > > >> Any thoughts? > > > > >> > > > > >> Thanks > > > > >> > > > > >> —Andries > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > >
