Re: Directory pruning with Drill

Andries Engelbrecht Wed, 04 Feb 2015 09:03:02 -0800

Aman,

In my case I can clearly see that with numerics for the directory name pruning 
is not performed when quotes were not applied.
This is for JSON, which should not matter.


select count(id) from tweet_base where dir_year=2015 and dir_month=01 and 
dir_day=10

01-04                Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), 
=(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = 
212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 io, 
0.0 network, 0.0 memory}, id = 34910
01-05                  Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): 
rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 cpu, 
0.0 io, 0.0 network, 0.0 memory}, id = 34909
01-06                    Scan(groupscan=[EasyGroupScan 
[selectionRoot=/user/aengelbrecht/twitter/nfl, numFiles=405, columns=[`id`, 
`dir0`, `dir1`, `dir2`],


Where
select count(id) from tweet_base where dir_year='2015' and dir_month='01' and 
dir_day='10'


01-02            Project(id=[CAST(CAST($0):BIGINT):BIGINT NOT NULL]): rowcount 
= 3670316.0, cumulative cost = {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 32130
01-03              Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): rowcount 
= 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 32129
01-04                Scan(groupscan=[EasyGroupScan 
[selectionRoot=/user/aengelbrecht/twitter/nfl, numFiles=24, columns=[`id`, 
`dir0`, `dir1`, `dir2`],

I’m using Drill 0.7

Thanks
—Andries





On Feb 4, 2015, at 8:53 AM, Aman Sinha <[email protected]> wrote:

> Andries,  the dir0=2015  with or without quotes should do the pruning.
> However, if the value is not numeric, such as 'Qtr1' then you need quotes.
> Here's my query which shows the pruning is being done:
> 
> (there are 3 directories: 1994, 1995, 1996 and query selects one of them
> and Q1 within that directory):
> 
> 0: jdbc:drill:zk=local> explain plan for select * from
> dfs.`/Users/asinha/data/multilevel/parquet` where dir0=1995 and dir1='Q1';
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(*=[$0])
> 00-02        Project(*=[$0])
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
> [path=/Users/asinha/data/multilevel/parquet/1995/Q1/orders_95_q1.parquet]],
> selectionRoot=/Users/asinha/data/multilevel/parquet, numFiles=1,
> columns=[`*`]]])
> 
> On Wed, Feb 4, 2015 at 8:06 AM, Andries Engelbrecht <
> [email protected]> wrote:
> 
>> Jason,
>> 
>> Thanks for the clarification, I added these to the enhancement request.
>> 
>> —Andries
>> 
>> 
>> On Feb 3, 2015, at 9:47 PM, Jason Altekruse <[email protected]>
>> wrote:
>> 
>>> Hao,
>>> 
>>> The dir columns are always added to the records coming out of a scan. The
>>> issue is with trying to avoid unneeded reads altogether. If you look at
>> the
>>> query plan you should see that the scan is going to read all of the files
>>> and the filter against the directory column will be applied in a separate
>>> filter operation later. Currently we only support simple expressions,
>>> either equality or an in-list to specify partition filters that can be
>>> pushed into the scan operation.
>>> 
>>> -Jason
>>> 
>>> On Tue, Feb 3, 2015 at 8:59 PM, Hao Zhu <[email protected]> wrote:
>>> 
>>>> Strange, per my testing, we can do that:
>>>> 
>>>> 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015`
>> where
>>>> dir0=1;
>>>> +------------+------------+
>>>> |  columns   |    dir0    |
>>>> +------------+------------+
>>>> | ["1","2","3"] | 1          |
>>>> +------------+------------+
>>>> 1 row selected (0.098 seconds)
>>>> 0: jdbc:drill:zk=n1a:5181,n2a:5181,n3a:5181> select * from `hao/2015`
>> where
>>>> dir0>1;
>>>> +------------+------------+
>>>> |  columns   |    dir0    |
>>>> +------------+------------+
>>>> | ["1","2","3"] | 3          |
>>>> | ["1","2","3"] | 2          |
>>>> +------------+------------+
>>>> 2 rows selected (0.18 seconds)
>>>> 
>>>> Thanks,
>>>> Hao
>>>> 
>>>> On Tue, Feb 3, 2015 at 8:27 PM, Tomer Shiran <[email protected]> wrote:
>>>> 
>>>>> The casting issue seems like a real bug. People want to do things like
>>>>> "dir0 > 2012"
>>>>> 
>>>>> On Tue, Feb 3, 2015 at 6:00 PM, Andries Engelbrecht <
>>>>> [email protected]> wrote:
>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> It will be good for users to understand the specifics of directory
>>>>> pruning.
>>>>>> 
>>>>>> As an additional note is is important to not cast the data typeof the
>>>> dir
>>>>>> filter and to provide a string (i.e. dir0=‘2015’) for pruning to work
>>>>>> properly.
>>>>>> With dir0=2015 the query to works, but the directories are no pruned
>>>>>> 
>>>>>> Similar if a view is created with columns for dir0, dir1, etc. the
>> data
>>>>>> types should not be casted or converted, based on current
>> observations.
>>>>>> 
>>>>>> It may be good to make it a bit friendlier for a better user
>>>> experience,
>>>>>> will file an enhancement request.
>>>>>> 
>>>>>> —Andries
>>>>>> 
>>>>>> 
>>>>>> On Feb 3, 2015, at 5:35 PM, Aman Sinha <[email protected]> wrote:
>>>>>> 
>>>>>>> Yes, that's the expected behavior for now.  Directory pruning where
>>>>> only
>>>>>>> subdirectory is specified is logically equivalent to wildcard
>>>> matching
>>>>> -
>>>>>>> '*/*/10'  which is not supported yet.  You could open an enhancement
>>>>>>> request.
>>>>>>> 
>>>>>>> On Tue, Feb 3, 2015 at 5:27 PM, Andries Engelbrecht <
>>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>>> Is it required for the directory pruning to work that a top down
>>>>> filter
>>>>>> of
>>>>>>>> directories be applied?
>>>>>>>> 
>>>>>>>> My current observation is that for a directory structure as listed
>>>>>> below,
>>>>>>>> the pruning only works if the full tree is provided. If only a lower
>>>>>> level
>>>>>>>> directory is supplied in the filter condition Drill only uses it as
>>>> a
>>>>>>>> filter.
>>>>>>>> 
>>>>>>>> /2015
>>>>>>>>       /01
>>>>>>>>              /10
>>>>>>>>              /11
>>>>>>>>              /12
>>>>>>>>              /13
>>>>>>>>              /14
>>>>>>>> 
>>>>>>>> select count(id) from `/foo` t where dir0='2015' and dir1='01' and
>>>>>>>> dir2='10'
>>>>>>>> Produces the correct pruning and query plan
>>>>>>>> 01-02            Project(id=[$3]): rowcount = 3670316.0, cumulative
>>>>>> cost =
>>>>>>>> {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0
>>>> memory},
>>>>>> id =
>>>>>>>> 28434
>>>>>>>> 01-03              Project(dir0=[$0], dir1=[$3], dir2=[$2],
>>>> id=[$1]):
>>>>>>>> rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7
>>>>> cpu,
>>>>>>>> 0.0 io, 0.0 network, 0.0 memory}, id = 28433
>>>>>>>> 01-04                Scan(groupscan=[EasyGroupScan
>>>>> [selectionRoot=/foo,
>>>>>>>> numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`]
>>>>>>>> 
>>>>>>>> 
>>>>>>>> However
>>>>>>>> select count(id) from `/foo` t where dir2='10'
>>>>>>>> Produces full scan of all sub directories and only applies a filter
>>>>>>>> condition after the fact. Notice the numFiles between the 2, even
>>>>>> though it
>>>>>>>> lists columns in the base scan
>>>>>>>> 01-04                Filter(condition=[=($0, '10')]): rowcount =
>>>>>>>> 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu,
>>>> 0.0
>>>>>> io,
>>>>>>>> 0.0 network, 0.0 memory}, id = 27470
>>>>>>>> 01-05                  Project(dir2=[$1], id=[$0]): rowcount =
>>>>>>>> 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu,
>>>>> 0.0
>>>>>>>> io, 0.0 network, 0.0 memory}, id = 27469
>>>>>>>> 01-06                    Scan(groupscan=[EasyGroupScan
>>>>>>>> [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`]
>>>>>>>> 
>>>>>>>> Any thoughts?
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> —Andries
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Directory pruning with Drill

Reply via email to