Re: Directory pruning with Drill

Hao Zhu Tue, 03 Feb 2015 21:55:45 -0800

Thanks, I thought we were saying the SQL "dir0>1;" does not work
functionally.
Not I got that, it does not do directory pruning.


On Tue, Feb 3, 2015 at 9:47 PM, Jason Altekruse <[email protected]>
wrote:

> Hao,
>
> The dir columns are always added to the records coming out of a scan. The
> issue is with trying to avoid unneeded reads altogether. If you look at the
> query plan you should see that the scan is going to read all of the files
> and the filter against the directory column will be applied in a separate
> filter operation later. Currently we only support simple expressions,
> either equality or an in-list to specify partition filters that can be
> pushed into the scan operation.
>
> -Jason
>
> On Tue, Feb 3, 2015 at 8:59 PM, Hao Zhu <[email protected]> wrote:
>
> > Strange, per my testing, we can do that:
> >
> > 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015`
> where
> > dir0=1;
> > +------------+------------+
> > |  columns   |    dir0    |
> > +------------+------------+
> > | ["1","2","3"] | 1          |
> > +------------+------------+
> > 1 row selected (0.098 seconds)
> > 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015`
> where
> > dir0>1;
> > +------------+------------+
> > |  columns   |    dir0    |
> > +------------+------------+
> > | ["1","2","3"] | 3          |
> > | ["1","2","3"] | 2          |
> > +------------+------------+
> > 2 rows selected (0.18 seconds)
> >
> > Thanks,
> > Hao
> >
> > On Tue, Feb 3, 2015 at 8:27 PM, Tomer Shiran <[email protected]> wrote:
> >
> > > The casting issue seems like a real bug. People want to do things like
> > > "dir0 > 2012"
> > >
> > > On Tue, Feb 3, 2015 at 6:00 PM, Andries Engelbrecht <
> > > [email protected]> wrote:
> > >
> > > > Thanks.
> > > >
> > > > It will be good for users to understand the specifics of directory
> > > pruning.
> > > >
> > > > As an additional note is is important to not cast the data typeof the
> > dir
> > > > filter and to provide a string (i.e. dir0=‘2015’) for pruning to work
> > > > properly.
> > > > With dir0=2015 the query to works, but the directories are no pruned
> > > >
> > > > Similar if a view is created with columns for dir0, dir1, etc. the
> data
> > > > types should not be casted or converted, based on current
> observations.
> > > >
> > > > It may be good to make it a bit friendlier for a better user
> > experience,
> > > > will file an enhancement request.
> > > >
> > > > —Andries
> > > >
> > > >
> > > > On Feb 3, 2015, at 5:35 PM, Aman Sinha <[email protected]> wrote:
> > > >
> > > > > Yes, that's the expected behavior for now.  Directory pruning where
> > > only
> > > > > subdirectory is specified is logically equivalent to wildcard
> > matching
> > > -
> > > > > '*/*/10'  which is not supported yet.  You could open an
> enhancement
> > > > > request.
> > > > >
> > > > > On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht <
> > > > > [email protected]> wrote:
> > > > >
> > > > >> Is it required for the directory pruning to work that a top down
> > > filter
> > > > of
> > > > >> directories be applied?
> > > > >>
> > > > >> My current observation is that for a directory structure as listed
> > > > below,
> > > > >> the pruning only works if the full tree is provided. If only a
> lower
> > > > level
> > > > >> directory is supplied in the filter condition Drill only uses it
> as
> > a
> > > > >> filter.
> > > > >>
> > > > >> /2015
> > > > >>         /01
> > > > >>                /10
> > > > >>                /11
> > > > >>                /12
> > > > >>                /13
> > > > >>                /14
> > > > >>
> > > > >> select count(id) from `/foo` t where dir0='2015' and dir1='01' and
> > > > >> dir2='10'
> > > > >> Produces the correct pruning and query plan
> > > > >> 01-02            Project(id=[$3]): rowcount = 3670316.0,
> cumulative
> > > > cost =
> > > > >> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0
> > memory},
> > > > id =
> > > > >> 28434
> > > > >> 01-03              Project(dir0=[$0], dir1=[$3], dir2=[$2],
> > id=[$1]):
> > > > >> rowcount = 3670316.0, cumulative cost = {7340632.0 rows,
> 1.468128E7
> > > cpu,
> > > > >> 0.0 io, 0.0 network, 0.0 memory}, id = 28433
> > > > >> 01-04                Scan(groupscan=[EasyGroupScan
> > > [selectionRoot=/foo,
> > > > >> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
> > > > >>
> > > > >>
> > > > >> However
> > > > >> select count(id) from `/foo` t where dir2='10'
> > > > >> Produces full scan of all sub directories and only applies a
> filter
> > > > >> condition after the fact. Notice the numFiles between the 2, even
> > > > though it
> > > > >> lists columns in the base scan
> > > > >> 01-04                Filter(condition=[=($0, '10')]): rowcount =
> > > > >> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu,
> > 0.0
> > > > io,
> > > > >> 0.0 network, 0.0 memory}, id = 27470
> > > > >> 01-05                  Project(dir2=[$1], id=[$0]): rowcount =
> > > > >> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8
> cpu,
> > > 0.0
> > > > >> io, 0.0 network, 0.0 memory}, id = 27469
> > > > >> 01-06                    Scan(groupscan=[EasyGroupScan
> > > > >> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
> > > > >>
> > > > >> Any thoughts?
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >> —Andries
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Directory pruning with Drill

Reply via email to