Re: Impala slow at order by

Boris Tyukin Fri, 01 Nov 2019 04:31:55 -0700

Did you try to remove limit clause? It seems from a plan that limit is
applied during exchange


On Fri, Nov 1, 2019, 00:00 Dinushka <[email protected]> wrote:

> Hi,
>
> Thank you very much for the advice. I have attached the profile logs. Each
> table is have 160 columns. each will only return one row and only
> difference between the queries is order by. Here the difference in time is
> 2 times the query without order by.
>
>
> Thanks
>
> On Thursday, October 31, 2019, 08:21:08 AM GMT+5:30, Tim Armstrong <
> [email protected]> wrote:
>
>
> > 1) Why is Impala slow with order by?
> Impala's sort implementation is generally very fast, but the bottleneck
> for a query can be in many different places.
>
> Like Shant said, the query profile will contain all the info needed to
> determine where time was spent in the query. We don't have enough
> information now to suggest why your query is performing the way it is. I'm
> surprised that a sort operation on a single row would make that much
> difference in query runtime.
>
> > 2) Can order by  be made faster in clustered mode, that mean made to
> be parallelized ?
> Yes, sorts are parallelised across the cluster, *except* for the final
> merge stage where the rows are collected and streamed back to the client.
>
> If you want to play around with increasing the parallelism of the sort,
> there is a query option mt_dop that switches to a different multithreading
> mode where you can increase the number of threads executing the plan,
> including sort. E.g. if you set mt_dop=4 then on each node you will get 4
> threads doing the sort (assuming there are enough input files that the work
> can be partitioned).
>
> > 3) Is it a good idea to use order by with Impala? if so have any body
> use it with a larger data set with good performance.
> If you're running a SELECT with an ORDER BY that returns a lot of rows,
> the bottleneck is most likely the output - Impala can probably sort the
> rows much faster than the client can fetch them.
>
> This actually makes it kinda hard to benchmark sort performance. We have
> some targeted benchmarks that use analytic functions to test sort
> performance to avoid the problem (it works because the query plans for
> analytic functions involve a sort). See
> https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test
>
> > 4) Is there any other solutions to do fast order by queries within few
> seconds. (Interactive query engines)
> This is exactly what Impala is good at.
>
> On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian <[email protected]>
> wrote:
>
> Hi Dinushka!
>
> That's awesome you're working with Apache Impala for your internship! You
> should know the Impala website provides some incredible documentation. In
> particular check out the performance section to learn about how to
> benchmark queries.
> https://impala.apache.org/docs/build/html/topics/impala_performance.html
>
> As you mentioned since it's in "standalone mode" all sorts of things can
> be going, however the easiest way to get some insights to do look at the
> explain plan and profile for the queries you run. For examples take a look
> at this page
> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html
>
> Any chance you can share the EXPLAIN and PROFILE statement output of your
> queries? You should also take a look at the output of the SUMMARY statement
> for yourself in case anything obvious stands out to you.
>
> Thanks!
>
> On Wed, Oct 30, 2019 at 9:10 PM Dinushka <[email protected]> wrote:
>
> Hi,
>
> I'm a student doing an internship, I have been given a task to do DB
> performance testing for kudu with Impala for our data and use case.
>
> Sample dataset is about 150 million records with 150 columns  and total
> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by
> hash (X =4,Y=2,Z=2)
>
> SQL 1= "select  A from table where  G="value""
> SQL 2= "select  A from table where  G="value" order by Z"
>
> I'm testing kudu and Impala in standalone mode and have 2 queries which
> will only return one row. One with "order by" and other without "order by"
> .
>
> When I do testing, I found that Impala with order by is about 15% to 35%
> slow. when you have order by in the SQL.
>
> In large row counts queries, it's time can be about 2-20 times more.
>
> 1) Why is Impala slow with order by?
>
> 2) Can order by  be made faster in clustered mode, that mean made to be 
> parallelized
> ?
>
> 3) Is it a good idea to use order by with Impala? if so have any body use
> it with a larger data set with good performance.
>
> 4) Is there any other solutions to do fast order by queries within few
> seconds. (Interactive query engines)
>
>
> Thank you
>
>

Re: Impala slow at order by

Reply via email to