Re: Directory pruning with Drill

Andries Engelbrecht Wed, 04 Feb 2015 08:09:09 -0800

Jason,

Thanks for the clarification, I added these to the enhancement request.


—Andries


On Feb 3, 2015, at 9:47 PM, Jason Altekruse <[email protected]> wrote:

> Hao,
> 
> The dir columns are always added to the records coming out of a scan. The
> issue is with trying to avoid unneeded reads altogether. If you look at the
> query plan you should see that the scan is going to read all of the files
> and the filter against the directory column will be applied in a separate
> filter operation later. Currently we only support simple expressions,
> either equality or an in-list to specify partition filters that can be
> pushed into the scan operation.
> 
> -Jason
> 
> On Tue, Feb 3, 2015 at 8:59 PM, Hao Zhu <[email protected]> wrote:
> 
>> Strange, per my testing, we can do that:
>> 
>> 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015` where
>> dir0=1;
>> +------------+------------+
>> |  columns   |    dir0    |
>> +------------+------------+
>> | ["1","2","3"] | 1          |
>> +------------+------------+
>> 1 row selected (0.098 seconds)
>> 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015` where
>> dir0>1;
>> +------------+------------+
>> |  columns   |    dir0    |
>> +------------+------------+
>> | ["1","2","3"] | 3          |
>> | ["1","2","3"] | 2          |
>> +------------+------------+
>> 2 rows selected (0.18 seconds)
>> 
>> Thanks,
>> Hao
>> 
>> On Tue, Feb 3, 2015 at 8:27 PM, Tomer Shiran <[email protected]> wrote:
>> 
>>> The casting issue seems like a real bug. People want to do things like
>>> "dir0 > 2012"
>>> 
>>> On Tue, Feb 3, 2015 at 6:00 PM, Andries Engelbrecht <
>>> [email protected]> wrote:
>>> 
>>>> Thanks.
>>>> 
>>>> It will be good for users to understand the specifics of directory
>>> pruning.
>>>> 
>>>> As an additional note is is important to not cast the data typeof the
>> dir
>>>> filter and to provide a string (i.e. dir0=‘2015’) for pruning to work
>>>> properly.
>>>> With dir0=2015 the query to works, but the directories are no pruned
>>>> 
>>>> Similar if a view is created with columns for dir0, dir1, etc. the data
>>>> types should not be casted or converted, based on current observations.
>>>> 
>>>> It may be good to make it a bit friendlier for a better user
>> experience,
>>>> will file an enhancement request.
>>>> 
>>>> —Andries
>>>> 
>>>> 
>>>> On Feb 3, 2015, at 5:35 PM, Aman Sinha <[email protected]> wrote:
>>>> 
>>>>> Yes, that's the expected behavior for now.  Directory pruning where
>>> only
>>>>> subdirectory is specified is logically equivalent to wildcard
>> matching
>>> -
>>>>> '*/*/10'  which is not supported yet.  You could open an enhancement
>>>>> request.
>>>>> 
>>>>> On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> Is it required for the directory pruning to work that a top down
>>> filter
>>>> of
>>>>>> directories be applied?
>>>>>> 
>>>>>> My current observation is that for a directory structure as listed
>>>> below,
>>>>>> the pruning only works if the full tree is provided. If only a lower
>>>> level
>>>>>> directory is supplied in the filter condition Drill only uses it as
>> a
>>>>>> filter.
>>>>>> 
>>>>>> /2015
>>>>>>        /01
>>>>>>               /10
>>>>>>               /11
>>>>>>               /12
>>>>>>               /13
>>>>>>               /14
>>>>>> 
>>>>>> select count(id) from `/foo` t where dir0='2015' and dir1='01' and
>>>>>> dir2='10'
>>>>>> Produces the correct pruning and query plan
>>>>>> 01-02            Project(id=[$3]): rowcount = 3670316.0, cumulative
>>>> cost =
>>>>>> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0
>> memory},
>>>> id =
>>>>>> 28434
>>>>>> 01-03              Project(dir0=[$0], dir1=[$3], dir2=[$2],
>> id=[$1]):
>>>>>> rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7
>>> cpu,
>>>>>> 0.0 io, 0.0 network, 0.0 memory}, id = 28433
>>>>>> 01-04                Scan(groupscan=[EasyGroupScan
>>> [selectionRoot=/foo,
>>>>>> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
>>>>>> 
>>>>>> 
>>>>>> However
>>>>>> select count(id) from `/foo` t where dir2='10'
>>>>>> Produces full scan of all sub directories and only applies a filter
>>>>>> condition after the fact. Notice the numFiles between the 2, even
>>>> though it
>>>>>> lists columns in the base scan
>>>>>> 01-04                Filter(condition=[=($0, '10')]): rowcount =
>>>>>> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu,
>> 0.0
>>>> io,
>>>>>> 0.0 network, 0.0 memory}, id = 27470
>>>>>> 01-05                  Project(dir2=[$1], id=[$0]): rowcount =
>>>>>> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu,
>>> 0.0
>>>>>> io, 0.0 network, 0.0 memory}, id = 27469
>>>>>> 01-06                    Scan(groupscan=[EasyGroupScan
>>>>>> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
>>>>>> 
>>>>>> Any thoughts?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> —Andries
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: Directory pruning with Drill

Reply via email to