I am sorry that wasn't clearer, I am not sure why I used those examples, but if have a table
table table/2016-02-04 table/2016-02-03 ... table/2015-07-01 (so a subdirectory for each day for half a year) then I make a new empty directory of table1 table1/2016-02-04 table1/2016-02-03 (only two subdirectories that I copied EXACTLY from table) select count(1) from table1 where dir0 = '2016-02-04' returns the count in 0.313 seconds select count(1) from table where dir0 = '2016-02-04' returns the same exact count in 9.879 seconds So yes, this applies to the dir pruning issue. Thanks! John On Tue, Feb 9, 2016 at 11:02 AM, Aman Sinha <[email protected]> wrote: > At a glance, John's query does not have a WHERE clause..it is querying the > subdirectory directly in the FROM clause..in this case Drill will only look > at the files within that subdirectory. Directory pruning only comes into > the picture when there is a WHERE condition on dir0, dir1 etc. > > > On Tue, Feb 9, 2016 at 8:25 AM, Neeraja Rentachintala < > [email protected]> wrote: > > > Yes, Drill-3759 covers it. > > This is a high priority enhancement that we are trying to get to in the > > next couple of releases. > > > > -Neeraja > > > > On Tue, Feb 9, 2016 at 7:32 AM, John Omernik <[email protected]> wrote: > > > > > This one seems to cover it: > > > > > > https://issues.apache.org/jira/browse/DRILL-3759 > > > > > > > > > > > > On Tue, Feb 9, 2016 at 9:25 AM, Abdel Hakim Deneche < > > [email protected] > > > > > > > wrote: > > > > > > > Hi John, > > > > > > > > Sorry I didn't get back to you (I thought I did). > > > > > > > > No, I don't need the plan, I just wanted to confirm what was taking > > most > > > of > > > > the time and you already confirmed it's the planning. > > > > > > > > Can you open a JIRA for this ? this may be a known issue, but I'm not > > > sure. > > > > > > > > Thanks > > > > > > > > On Tue, Feb 9, 2016 at 6:08 AM, John Omernik <[email protected]> > wrote: > > > > > > > > > Abdel, do you still need the plans, as I said, if your table has > any > > > > decent > > > > > amount of directories and files, it looks like the planning is > > touching > > > > all > > > > > the directories even though you are pruning. I can post plans, > > > however, > > > > I > > > > > think in this case you'll find they are exactly the same, and the > > only > > > > > difference is that the longer queries is planning much more because > > it > > > > has > > > > > more files to read. > > > > > > > > > > > > > > > On Thu, Feb 4, 2016 at 10:46 AM, John Omernik <[email protected]> > > > wrote: > > > > > > > > > > > I can package up both plans for you if you need them (let me know > > if > > > > you > > > > > > still want them) but I can tell you the plans were EXACTLY the > > same, > > > > > > however the data-sum table took 0.932 seconds to plan the query, > > and > > > > the > > > > > > data table (the one with the all the extra data) took 11.379 > > seconds > > > to > > > > > > plan the query. Indicating to me the issue isn't in the plan that > > was > > > > > > created, but the actual planning process. (Let me know if you > > > disagree > > > > or > > > > > > still need to see the plan, like I said, the actual plans were > > > exactly > > > > > the > > > > > > same) > > > > > > > > > > > > > > > > > > John. > > > > > > > > > > > > > > > > > > On Thu, Feb 4, 2016 at 10:31 AM, Abdel Hakim Deneche < > > > > > > [email protected]> wrote: > > > > > > > > > > > >> Hey John, can you try an explain plan for both queries and see > how > > > > much > > > > > >> times it takes ? > > > > > >> > > > > > >> for example, for the first query you would run: > > > > > >> > > > > > >> *explain plan for* select count(1) from `data/2016-02-03`; > > > > > >> > > > > > >> It can also be helpful if you could share the query profiles for > > > both > > > > > >> queries. > > > > > >> > > > > > >> Thanks > > > > > >> > > > > > >> On Thu, Feb 4, 2016 at 8:15 AM, John Omernik <[email protected]> > > > > wrote: > > > > > >> > > > > > >> > Hey all, I think am I seeing an issue related to > > > > > >> > https://issues.apache.org/jira/browse/DRILL-3759 but I want > to > > > > > >> describe it > > > > > >> > out here, see if it's really the case, and then determine what > > the > > > > > >> blockers > > > > > >> > may be to resolution. > > > > > >> > > > > > > >> > I am using the MapR Developer Release 1.4, and I have a > > directory > > > > with > > > > > >> > subdirectories by data. > > > > > >> > > > > > > >> > data/2015-01-01 > > > > > >> > data/2015-01-02 > > > > > >> > data/2015-01-03 > > > > > >> > > > > > > >> > These are stored as Parquet files. At this point Each data > > > averages > > > > > >> about > > > > > >> > 1 GB of data, and has roughly 75 parquet files in it. > > > > > >> > > > > > > >> > When I run > > > > > >> > > > > > > >> > select count(1) from `data/2016-02-03` it takes roughly 11 > > > seconds. > > > > > >> > > > > > > >> > If I copy the 2016-02-03 directory to a new base (date-sum) > and > > > run > > > > > >> > > > > > > >> > select count(1) from `data_sum/2016-02-03` it runs in 0.874 > > > seconds. > > > > > >> > > > > > > >> > Same data, same structure, only difference is the data_sum > > > directory > > > > > >> only > > > > > >> > has a few directories, iand data has dates going back to Nov > > 2015. > > > > It > > > > > >> > seems like it is getting files name for all files in each > > > directory > > > > > >> prior > > > > > >> > to pruning which seems to me to be adding a lot of latency to > > > > queries > > > > > >> that > > > > > >> > doesn't need to be there. (thus I think I am seeing 3759) > but I > > > > > wanted > > > > > >> to > > > > > >> > confirm, and then I wanted to see how we can address this in > > that > > > > the > > > > > >> > directory prune should be fast, and on large data sets its > just > > > > going > > > > > to > > > > > >> > get worse and worse. > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > John > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> -- > > > > > >> > > > > > >> Abdelhakim Deneche > > > > > >> > > > > > >> Software Engineer > > > > > >> > > > > > >> <http://www.mapr.com/> > > > > > >> > > > > > >> > > > > > >> Now Available - Free Hadoop On-Demand Training > > > > > >> < > > > > > >> > > > > > > > > > > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Abdelhakim Deneche > > > > > > > > Software Engineer > > > > > > > > <http://www.mapr.com/> > > > > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > > < > > > > > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > > > > > > > >
