The casting issue seems like a real bug. People want to do things like
"dir0 > 2012"

On Tue, Feb 3, 2015 at 6:00 PM, Andries Engelbrecht <
[email protected]> wrote:

> Thanks.
>
> It will be good for users to understand the specifics of directory pruning.
>
> As an additional note is is important to not cast the data typeof the dir
> filter and to provide a string (i.e. dir0=‘2015’) for pruning to work
> properly.
> With dir0=2015 the query to works, but the directories are no pruned
>
> Similar if a view is created with columns for dir0, dir1, etc. the data
> types should not be casted or converted, based on current observations.
>
> It may be good to make it a bit friendlier for a better user experience,
> will file an enhancement request.
>
> —Andries
>
>
> On Feb 3, 2015, at 5:35 PM, Aman Sinha <[email protected]> wrote:
>
> > Yes, that's the expected behavior for now.  Directory pruning where only
> > subdirectory is specified is logically equivalent to wildcard matching -
> > '*/*/10'  which is not supported yet.  You could open an enhancement
> > request.
> >
> > On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht <
> > [email protected]> wrote:
> >
> >> Is it required for the directory pruning to work that a top down filter
> of
> >> directories be applied?
> >>
> >> My current observation is that for a directory structure as listed
> below,
> >> the pruning only works if the full tree is provided. If only a lower
> level
> >> directory is supplied in the filter condition Drill only uses it as a
> >> filter.
> >>
> >> /2015
> >>         /01
> >>                /10
> >>                /11
> >>                /12
> >>                /13
> >>                /14
> >>
> >> select count(id) from `/foo` t where dir0='2015' and dir1='01' and
> >> dir2='10'
> >> Produces the correct pruning and query plan
> >> 01-02            Project(id=[$3]): rowcount = 3670316.0, cumulative
> cost =
> >> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory},
> id =
> >> 28434
> >> 01-03              Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]):
> >> rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu,
> >> 0.0 io, 0.0 network, 0.0 memory}, id = 28433
> >> 01-04                Scan(groupscan=[EasyGroupScan [selectionRoot=/foo,
> >> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
> >>
> >>
> >> However
> >> select count(id) from `/foo` t where dir2='10'
> >> Produces full scan of all sub directories and only applies a filter
> >> condition after the fact. Notice the numFiles between the 2, even
> though it
> >> lists columns in the base scan
> >> 01-04                Filter(condition=[=($0, '10')]): rowcount =
> >> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0
> io,
> >> 0.0 network, 0.0 memory}, id = 27470
> >> 01-05                  Project(dir2=[$1], id=[$0]): rowcount =
> >> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0
> >> io, 0.0 network, 0.0 memory}, id = 27469
> >> 01-06                    Scan(groupscan=[EasyGroupScan
> >> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
> >>
> >> Any thoughts?
> >>
> >> Thanks
> >>
> >> —Andries
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
>

Reply via email to