This one seems to cover it: https://issues.apache.org/jira/browse/DRILL-3759
On Tue, Feb 9, 2016 at 9:25 AM, Abdel Hakim Deneche <[email protected]> wrote: > Hi John, > > Sorry I didn't get back to you (I thought I did). > > No, I don't need the plan, I just wanted to confirm what was taking most of > the time and you already confirmed it's the planning. > > Can you open a JIRA for this ? this may be a known issue, but I'm not sure. > > Thanks > > On Tue, Feb 9, 2016 at 6:08 AM, John Omernik <[email protected]> wrote: > > > Abdel, do you still need the plans, as I said, if your table has any > decent > > amount of directories and files, it looks like the planning is touching > all > > the directories even though you are pruning. I can post plans, however, > I > > think in this case you'll find they are exactly the same, and the only > > difference is that the longer queries is planning much more because it > has > > more files to read. > > > > > > On Thu, Feb 4, 2016 at 10:46 AM, John Omernik <[email protected]> wrote: > > > > > I can package up both plans for you if you need them (let me know if > you > > > still want them) but I can tell you the plans were EXACTLY the same, > > > however the data-sum table took 0.932 seconds to plan the query, and > the > > > data table (the one with the all the extra data) took 11.379 seconds to > > > plan the query. Indicating to me the issue isn't in the plan that was > > > created, but the actual planning process. (Let me know if you disagree > or > > > still need to see the plan, like I said, the actual plans were exactly > > the > > > same) > > > > > > > > > John. > > > > > > > > > On Thu, Feb 4, 2016 at 10:31 AM, Abdel Hakim Deneche < > > > [email protected]> wrote: > > > > > >> Hey John, can you try an explain plan for both queries and see how > much > > >> times it takes ? > > >> > > >> for example, for the first query you would run: > > >> > > >> *explain plan for* select count(1) from `data/2016-02-03`; > > >> > > >> It can also be helpful if you could share the query profiles for both > > >> queries. > > >> > > >> Thanks > > >> > > >> On Thu, Feb 4, 2016 at 8:15 AM, John Omernik <[email protected]> > wrote: > > >> > > >> > Hey all, I think am I seeing an issue related to > > >> > https://issues.apache.org/jira/browse/DRILL-3759 but I want to > > >> describe it > > >> > out here, see if it's really the case, and then determine what the > > >> blockers > > >> > may be to resolution. > > >> > > > >> > I am using the MapR Developer Release 1.4, and I have a directory > with > > >> > subdirectories by data. > > >> > > > >> > data/2015-01-01 > > >> > data/2015-01-02 > > >> > data/2015-01-03 > > >> > > > >> > These are stored as Parquet files. At this point Each data averages > > >> about > > >> > 1 GB of data, and has roughly 75 parquet files in it. > > >> > > > >> > When I run > > >> > > > >> > select count(1) from `data/2016-02-03` it takes roughly 11 seconds. > > >> > > > >> > If I copy the 2016-02-03 directory to a new base (date-sum) and run > > >> > > > >> > select count(1) from `data_sum/2016-02-03` it runs in 0.874 seconds. > > >> > > > >> > Same data, same structure, only difference is the data_sum directory > > >> only > > >> > has a few directories, iand data has dates going back to Nov 2015. > It > > >> > seems like it is getting files name for all files in each directory > > >> prior > > >> > to pruning which seems to me to be adding a lot of latency to > queries > > >> that > > >> > doesn't need to be there. (thus I think I am seeing 3759) but I > > wanted > > >> to > > >> > confirm, and then I wanted to see how we can address this in that > the > > >> > directory prune should be fast, and on large data sets its just > going > > to > > >> > get worse and worse. > > >> > > > >> > > > >> > > > >> > John > > >> > > > >> > > >> > > >> > > >> -- > > >> > > >> Abdelhakim Deneche > > >> > > >> Software Engineer > > >> > > >> <http://www.mapr.com/> > > >> > > >> > > >> Now Available - Free Hadoop On-Demand Training > > >> < > > >> > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > >> > > > >> > > > > > > > > > > > > -- > > Abdelhakim Deneche > > Software Engineer > > <http://www.mapr.com/> > > > Now Available - Free Hadoop On-Demand Training > < > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > >
