Re: Query planning cost

Jacques Nadeau Wed, 06 May 2015 23:36:04 -0700

The read should be parallelized.  See FooterGatherer.  What makes you think
it isn't parallelized?

We've seen this set of operations expensive in some situations and quite
bad in the case of 100,000's of files.  We're working on improvement to
this issue with this jira:

https://issues.apache.org/jira/browse/DRILL-2743

Note, I also think Steven has identified some places where we re-get
FileStatus multiple times which can also lead to poorer start performance.
I"m not sure there is an issue open against this but we should get one
opened and resolved.

On Wed, May 6, 2015 at 11:13 PM, Adam Gilmore <[email protected]> wrote:

> Just a follow up - I have isolated that it is almost linear according to
> the number of Parquet files.  The footer read is quite expensive and not
> parallelised at all (it uses it for query planning).
>
> Is there any way to control the row group size when creating Parquet
> files?  I could create fewer, larger files, but still want the benefit of
> smaller row groups (as I have just done the Parquet pushdown filtering).
>
> On Thu, May 7, 2015 at 4:08 PM, Adam Gilmore <[email protected]>
> wrote:
>
> > Hi guys,
> >
> > I've been looking at the speed of some of our queries and have noticed
> > there is quite a significant delay to the query actually starting.
> >
> > For example, querying about 70 Parquet files in a directory, it takes
> > about 370ms before it starts the first fragment.
> >
> > Obviously, considering it's not in the plan, it's very hard to see where
> > exactly it's spending that 370ms without instrumenting/debugging.
> >
> > How can I troubleshoot where Drill is spending this 370ms?
> >
>

Re: Query planning cost

Reply via email to