Query Planning and Directory Pruning

John Omernik Thu, 04 Feb 2016 08:16:54 -0800

Hey all, I think am I seeing an issue related to
https://issues.apache.org/jira/browse/DRILL-3759 but I want to describe it
out here, see if it's really the case, and then determine what the blockers
may be to resolution.


I am using the MapR Developer Release 1.4, and I have a directory with
subdirectories by data.

data/2015-01-01
data/2015-01-02
data/2015-01-03

These are stored as Parquet files.  At this point Each data averages about
1 GB of data, and has roughly 75 parquet files in it.

When I run

select count(1) from `data/2016-02-03` it takes roughly 11 seconds.

If I copy the 2016-02-03 directory to a new base (date-sum) and run

select count(1) from `data_sum/2016-02-03` it runs in 0.874 seconds.

Same data, same structure, only difference is the data_sum directory only
has a few directories, iand data has dates going back to Nov 2015.  It
seems like it is getting files name for all files in each directory prior
to pruning which seems to me to be adding a lot of latency to queries that
doesn't need to be there.  (thus I think I am seeing 3759) but I wanted to
confirm, and then I wanted to see how we can address this in that the
directory prune should be fast, and on large data sets its just going to
get worse and worse.



John

Query Planning and Directory Pruning

Reply via email to