Re: Directory pruning with Drill

Andries Engelbrecht Tue, 03 Feb 2015 18:02:09 -0800

Thanks.

It will be good for users to understand the specifics of directory pruning.


As an additional note is is important to not cast the data typeof the dir 
filter and to provide a string (i.e. dir0=‘2015’) for pruning to work properly.
With dir0=2015 the query to works, but the directories are no pruned

Similar if a view is created with columns for dir0, dir1, etc. the data types 
should not be casted or converted, based on current observations.

It may be good to make it a bit friendlier for a better user experience, will 
file an enhancement request.

—Andries
 

On Feb 3, 2015, at 5:35 PM, Aman Sinha <[email protected]> wrote:

> Yes, that's the expected behavior for now.  Directory pruning where only
> subdirectory is specified is logically equivalent to wildcard matching -
> '*/*/10'  which is not supported yet.  You could open an enhancement
> request.
> 
> On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht <
> [email protected]> wrote:
> 
>> Is it required for the directory pruning to work that a top down filter of
>> directories be applied?
>> 
>> My current observation is that for a directory structure as listed below,
>> the pruning only works if the full tree is provided. If only a lower level
>> directory is supplied in the filter condition Drill only uses it as a
>> filter.
>> 
>> /2015
>>         /01
>>                /10
>>                /11
>>                /12
>>                /13
>>                /14
>> 
>> select count(id) from `/foo` t where dir0='2015' and dir1='01' and
>> dir2='10'
>> Produces the correct pruning and query plan
>> 01-02            Project(id=[$3]): rowcount = 3670316.0, cumulative cost =
>> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id =
>> 28434
>> 01-03              Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]):
>> rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu,
>> 0.0 io, 0.0 network, 0.0 memory}, id = 28433
>> 01-04                Scan(groupscan=[EasyGroupScan [selectionRoot=/foo,
>> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
>> 
>> 
>> However
>> select count(id) from `/foo` t where dir2='10'
>> Produces full scan of all sub directories and only applies a filter
>> condition after the fact. Notice the numFiles between the 2, even though it
>> lists columns in the base scan
>> 01-04                Filter(condition=[=($0, '10')]): rowcount =
>> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io,
>> 0.0 network, 0.0 memory}, id = 27470
>> 01-05                  Project(dir2=[$1], id=[$0]): rowcount =
>> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0
>> io, 0.0 network, 0.0 memory}, id = 27469
>> 01-06                    Scan(groupscan=[EasyGroupScan
>> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
>> 
>> Any thoughts?
>> 
>> Thanks
>> 
>> —Andries
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>

Re: Directory pruning with Drill

Reply via email to