Hey John, can you try an explain plan for both queries and see how much times it takes ?
for example, for the first query you would run: *explain plan for* select count(1) from `data/2016-02-03`; It can also be helpful if you could share the query profiles for both queries. Thanks On Thu, Feb 4, 2016 at 8:15 AM, John Omernik <[email protected]> wrote: > Hey all, I think am I seeing an issue related to > https://issues.apache.org/jira/browse/DRILL-3759 but I want to describe it > out here, see if it's really the case, and then determine what the blockers > may be to resolution. > > I am using the MapR Developer Release 1.4, and I have a directory with > subdirectories by data. > > data/2015-01-01 > data/2015-01-02 > data/2015-01-03 > > These are stored as Parquet files. At this point Each data averages about > 1 GB of data, and has roughly 75 parquet files in it. > > When I run > > select count(1) from `data/2016-02-03` it takes roughly 11 seconds. > > If I copy the 2016-02-03 directory to a new base (date-sum) and run > > select count(1) from `data_sum/2016-02-03` it runs in 0.874 seconds. > > Same data, same structure, only difference is the data_sum directory only > has a few directories, iand data has dates going back to Nov 2015. It > seems like it is getting files name for all files in each directory prior > to pruning which seems to me to be adding a lot of latency to queries that > doesn't need to be there. (thus I think I am seeing 3759) but I wanted to > confirm, and then I wanted to see how we can address this in that the > directory prune should be fast, and on large data sets its just going to > get worse and worse. > > > > John > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
