Re: Query Planning and Directory Pruning

Aman Sinha Tue, 09 Feb 2016 09:02:39 -0800

At a glance, John's query does not have a WHERE clause..it is querying the
subdirectory directly in the FROM clause..in this case Drill will only look
at the files within that subdirectory.  Directory pruning only comes into
the picture when there is a WHERE condition on dir0, dir1 etc.



On Tue, Feb 9, 2016 at 8:25 AM, Neeraja Rentachintala <
[email protected]> wrote:

> Yes, Drill-3759 covers it.
> This is a high priority enhancement that we are trying to get to in the
> next couple of releases.
>
> -Neeraja
>
> On Tue, Feb 9, 2016 at 7:32 AM, John Omernik <[email protected]> wrote:
>
> > This one seems to cover it:
> >
> > https://issues.apache.org/jira/browse/DRILL-3759
> >
> >
> >
> > On Tue, Feb 9, 2016 at 9:25 AM, Abdel Hakim Deneche <
> [email protected]
> > >
> > wrote:
> >
> > > Hi John,
> > >
> > > Sorry I didn't get back to you (I thought I did).
> > >
> > > No, I don't need the plan, I just wanted to confirm what was taking
> most
> > of
> > > the time and you already confirmed it's the planning.
> > >
> > > Can you open a JIRA for this ? this may be a known issue, but I'm not
> > sure.
> > >
> > > Thanks
> > >
> > > On Tue, Feb 9, 2016 at 6:08 AM, John Omernik <[email protected]> wrote:
> > >
> > > > Abdel, do you still need the plans, as I said, if your table has any
> > > decent
> > > > amount of directories and files, it looks like the planning is
> touching
> > > all
> > > > the directories even though you are pruning.  I can post plans,
> > however,
> > > I
> > > > think in this case you'll find they are exactly the same, and the
> only
> > > > difference is that the longer queries is planning much more because
> it
> > > has
> > > > more files to read.
> > > >
> > > >
> > > > On Thu, Feb 4, 2016 at 10:46 AM, John Omernik <[email protected]>
> > wrote:
> > > >
> > > > > I can package up both plans for you if you need them (let me know
> if
> > > you
> > > > > still want them) but I can tell you the plans were EXACTLY the
> same,
> > > > > however the data-sum table took 0.932 seconds to plan the query,
> and
> > > the
> > > > > data table (the one with the all the extra data) took 11.379
> seconds
> > to
> > > > > plan the query. Indicating to me the issue isn't in the plan that
> was
> > > > > created, but the actual planning process. (Let me know if you
> > disagree
> > > or
> > > > > still need to see the plan, like I said, the actual plans were
> > exactly
> > > > the
> > > > > same)
> > > > >
> > > > >
> > > > > John.
> > > > >
> > > > >
> > > > > On Thu, Feb 4, 2016 at 10:31 AM, Abdel Hakim Deneche <
> > > > > [email protected]> wrote:
> > > > >
> > > > >> Hey John, can you try an explain plan for both queries and see how
> > > much
> > > > >> times it takes ?
> > > > >>
> > > > >> for example, for the first query you would run:
> > > > >>
> > > > >> *explain plan for* select count(1) from `data/2016-02-03`;
> > > > >>
> > > > >> It can also be helpful if you could share the query profiles for
> > both
> > > > >> queries.
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >> On Thu, Feb 4, 2016 at 8:15 AM, John Omernik <[email protected]>
> > > wrote:
> > > > >>
> > > > >> > Hey all, I think am I seeing an issue related to
> > > > >> > https://issues.apache.org/jira/browse/DRILL-3759 but I want to
> > > > >> describe it
> > > > >> > out here, see if it's really the case, and then determine what
> the
> > > > >> blockers
> > > > >> > may be to resolution.
> > > > >> >
> > > > >> > I am using the MapR Developer Release 1.4, and I have a
> directory
> > > with
> > > > >> > subdirectories by data.
> > > > >> >
> > > > >> > data/2015-01-01
> > > > >> > data/2015-01-02
> > > > >> > data/2015-01-03
> > > > >> >
> > > > >> > These are stored as Parquet files.  At this point Each data
> > averages
> > > > >> about
> > > > >> > 1 GB of data, and has roughly 75 parquet files in it.
> > > > >> >
> > > > >> > When I run
> > > > >> >
> > > > >> > select count(1) from `data/2016-02-03` it takes roughly 11
> > seconds.
> > > > >> >
> > > > >> > If I copy the 2016-02-03 directory to a new base (date-sum) and
> > run
> > > > >> >
> > > > >> > select count(1) from `data_sum/2016-02-03` it runs in 0.874
> > seconds.
> > > > >> >
> > > > >> > Same data, same structure, only difference is the data_sum
> > directory
> > > > >> only
> > > > >> > has a few directories, iand data has dates going back to Nov
> 2015.
> > > It
> > > > >> > seems like it is getting files name for all files in each
> > directory
> > > > >> prior
> > > > >> > to pruning which seems to me to be adding a lot of latency to
> > > queries
> > > > >> that
> > > > >> > doesn't need to be there.  (thus I think I am seeing 3759) but I
> > > > wanted
> > > > >> to
> > > > >> > confirm, and then I wanted to see how we can address this in
> that
> > > the
> > > > >> > directory prune should be fast, and on large data sets its just
> > > going
> > > > to
> > > > >> > get worse and worse.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > John
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >>
> > > > >> Abdelhakim Deneche
> > > > >>
> > > > >> Software Engineer
> > > > >>
> > > > >>   <http://www.mapr.com/>
> > > > >>
> > > > >>
> > > > >> Now Available - Free Hadoop On-Demand Training
> > > > >> <
> > > > >>
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > >
> > >
> >
>

Re: Query Planning and Directory Pruning

Reply via email to