Jason, Thanks for the clarification, I added these to the enhancement request.
—Andries On Feb 3, 2015, at 9:47 PM, Jason Altekruse <[email protected]> wrote: > Hao, > > The dir columns are always added to the records coming out of a scan. The > issue is with trying to avoid unneeded reads altogether. If you look at the > query plan you should see that the scan is going to read all of the files > and the filter against the directory column will be applied in a separate > filter operation later. Currently we only support simple expressions, > either equality or an in-list to specify partition filters that can be > pushed into the scan operation. > > -Jason > > On Tue, Feb 3, 2015 at 8:59 PM, Hao Zhu <[email protected]> wrote: > >> Strange, per my testing, we can do that: >> >> 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015` where >> dir0=1; >> +------------+------------+ >> | columns | dir0 | >> +------------+------------+ >> | ["1","2","3"] | 1 | >> +------------+------------+ >> 1 row selected (0.098 seconds) >> 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015` where >> dir0>1; >> +------------+------------+ >> | columns | dir0 | >> +------------+------------+ >> | ["1","2","3"] | 3 | >> | ["1","2","3"] | 2 | >> +------------+------------+ >> 2 rows selected (0.18 seconds) >> >> Thanks, >> Hao >> >> On Tue, Feb 3, 2015 at 8:27 PM, Tomer Shiran <[email protected]> wrote: >> >>> The casting issue seems like a real bug. People want to do things like >>> "dir0 > 2012" >>> >>> On Tue, Feb 3, 2015 at 6:00 PM, Andries Engelbrecht < >>> [email protected]> wrote: >>> >>>> Thanks. >>>> >>>> It will be good for users to understand the specifics of directory >>> pruning. >>>> >>>> As an additional note is is important to not cast the data typeof the >> dir >>>> filter and to provide a string (i.e. dir0=‘2015’) for pruning to work >>>> properly. >>>> With dir0=2015 the query to works, but the directories are no pruned >>>> >>>> Similar if a view is created with columns for dir0, dir1, etc. the data >>>> types should not be casted or converted, based on current observations. >>>> >>>> It may be good to make it a bit friendlier for a better user >> experience, >>>> will file an enhancement request. >>>> >>>> —Andries >>>> >>>> >>>> On Feb 3, 2015, at 5:35 PM, Aman Sinha <[email protected]> wrote: >>>> >>>>> Yes, that's the expected behavior for now. Directory pruning where >>> only >>>>> subdirectory is specified is logically equivalent to wildcard >> matching >>> - >>>>> '*/*/10' which is not supported yet. You could open an enhancement >>>>> request. >>>>> >>>>> On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht < >>>>> [email protected]> wrote: >>>>> >>>>>> Is it required for the directory pruning to work that a top down >>> filter >>>> of >>>>>> directories be applied? >>>>>> >>>>>> My current observation is that for a directory structure as listed >>>> below, >>>>>> the pruning only works if the full tree is provided. If only a lower >>>> level >>>>>> directory is supplied in the filter condition Drill only uses it as >> a >>>>>> filter. >>>>>> >>>>>> /2015 >>>>>> /01 >>>>>> /10 >>>>>> /11 >>>>>> /12 >>>>>> /13 >>>>>> /14 >>>>>> >>>>>> select count(id) from `/foo` t where dir0='2015' and dir1='01' and >>>>>> dir2='10' >>>>>> Produces the correct pruning and query plan >>>>>> 01-02 Project(id=[$3]): rowcount = 3670316.0, cumulative >>>> cost = >>>>>> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 >> memory}, >>>> id = >>>>>> 28434 >>>>>> 01-03 Project(dir0=[$0], dir1=[$3], dir2=[$2], >> id=[$1]): >>>>>> rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 >>> cpu, >>>>>> 0.0 io, 0.0 network, 0.0 memory}, id = 28433 >>>>>> 01-04 Scan(groupscan=[EasyGroupScan >>> [selectionRoot=/foo, >>>>>> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`] >>>>>> >>>>>> >>>>>> However >>>>>> select count(id) from `/foo` t where dir2='10' >>>>>> Produces full scan of all sub directories and only applies a filter >>>>>> condition after the fact. Notice the numFiles between the 2, even >>>> though it >>>>>> lists columns in the base scan >>>>>> 01-04 Filter(condition=[=($0, '10')]): rowcount = >>>>>> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, >> 0.0 >>>> io, >>>>>> 0.0 network, 0.0 memory}, id = 27470 >>>>>> 01-05 Project(dir2=[$1], id=[$0]): rowcount = >>>>>> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, >>> 0.0 >>>>>> io, 0.0 network, 0.0 memory}, id = 27469 >>>>>> 01-06 Scan(groupscan=[EasyGroupScan >>>>>> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`] >>>>>> >>>>>> Any thoughts? >>>>>> >>>>>> Thanks >>>>>> >>>>>> —Andries >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> >>> >>
