Re: Directory pruning with Drill

Jason Altekruse Tue, 03 Feb 2015 21:49:48 -0800

Hao,

The dir columns are always added to the records coming out of a scan. The
issue is with trying to avoid unneeded reads altogether. If you look at the
query plan you should see that the scan is going to read all of the files
and the filter against the directory column will be applied in a separate
filter operation later. Currently we only support simple expressions,
either equality or an in-list to specify partition filters that can be
pushed into the scan operation.


-Jason

On Tue, Feb 3, 2015 at 8:59 PM, Hao Zhu <[email protected]> wrote:

> Strange, per my testing, we can do that:
>
> 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015` where
> dir0=1;
> +------------+------------+
> |  columns   |    dir0    |
> +------------+------------+
> | ["1","2","3"] | 1          |
> +------------+------------+
> 1 row selected (0.098 seconds)
> 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015` where
> dir0>1;
> +------------+------------+
> |  columns   |    dir0    |
> +------------+------------+
> | ["1","2","3"] | 3          |
> | ["1","2","3"] | 2          |
> +------------+------------+
> 2 rows selected (0.18 seconds)
>
> Thanks,
> Hao
>
> On Tue, Feb 3, 2015 at 8:27 PM, Tomer Shiran <[email protected]> wrote:
>
> > The casting issue seems like a real bug. People want to do things like
> > "dir0 > 2012"
> >
> > On Tue, Feb 3, 2015 at 6:00 PM, Andries Engelbrecht <
> > [email protected]> wrote:
> >
> > > Thanks.
> > >
> > > It will be good for users to understand the specifics of directory
> > pruning.
> > >
> > > As an additional note is is important to not cast the data typeof the
> dir
> > > filter and to provide a string (i.e. dir0=‘2015’) for pruning to work
> > > properly.
> > > With dir0=2015 the query to works, but the directories are no pruned
> > >
> > > Similar if a view is created with columns for dir0, dir1, etc. the data
> > > types should not be casted or converted, based on current observations.
> > >
> > > It may be good to make it a bit friendlier for a better user
> experience,
> > > will file an enhancement request.
> > >
> > > —Andries
> > >
> > >
> > > On Feb 3, 2015, at 5:35 PM, Aman Sinha <[email protected]> wrote:
> > >
> > > > Yes, that's the expected behavior for now.  Directory pruning where
> > only
> > > > subdirectory is specified is logically equivalent to wildcard
> matching
> > -
> > > > '*/*/10'  which is not supported yet.  You could open an enhancement
> > > > request.
> > > >
> > > > On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht <
> > > > [email protected]> wrote:
> > > >
> > > >> Is it required for the directory pruning to work that a top down
> > filter
> > > of
> > > >> directories be applied?
> > > >>
> > > >> My current observation is that for a directory structure as listed
> > > below,
> > > >> the pruning only works if the full tree is provided. If only a lower
> > > level
> > > >> directory is supplied in the filter condition Drill only uses it as
> a
> > > >> filter.
> > > >>
> > > >> /2015
> > > >>         /01
> > > >>                /10
> > > >>                /11
> > > >>                /12
> > > >>                /13
> > > >>                /14
> > > >>
> > > >> select count(id) from `/foo` t where dir0='2015' and dir1='01' and
> > > >> dir2='10'
> > > >> Produces the correct pruning and query plan
> > > >> 01-02            Project(id=[$3]): rowcount = 3670316.0, cumulative
> > > cost =
> > > >> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0
> memory},
> > > id =
> > > >> 28434
> > > >> 01-03              Project(dir0=[$0], dir1=[$3], dir2=[$2],
> id=[$1]):
> > > >> rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7
> > cpu,
> > > >> 0.0 io, 0.0 network, 0.0 memory}, id = 28433
> > > >> 01-04                Scan(groupscan=[EasyGroupScan
> > [selectionRoot=/foo,
> > > >> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
> > > >>
> > > >>
> > > >> However
> > > >> select count(id) from `/foo` t where dir2='10'
> > > >> Produces full scan of all sub directories and only applies a filter
> > > >> condition after the fact. Notice the numFiles between the 2, even
> > > though it
> > > >> lists columns in the base scan
> > > >> 01-04                Filter(condition=[=($0, '10')]): rowcount =
> > > >> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu,
> 0.0
> > > io,
> > > >> 0.0 network, 0.0 memory}, id = 27470
> > > >> 01-05                  Project(dir2=[$1], id=[$0]): rowcount =
> > > >> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu,
> > 0.0
> > > >> io, 0.0 network, 0.0 memory}, id = 27469
> > > >> 01-06                    Scan(groupscan=[EasyGroupScan
> > > >> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
> > > >>
> > > >> Any thoughts?
> > > >>
> > > >> Thanks
> > > >>
> > > >> —Andries
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Directory pruning with Drill

Reply via email to