Did you try to remove limit clause? It seems from a plan that limit is applied during exchange
On Fri, Nov 1, 2019, 00:00 Dinushka <dinushk...@yahoo.com> wrote: > Hi, > > Thank you very much for the advice. I have attached the profile logs. Each > table is have 160 columns. each will only return one row and only > difference between the queries is order by. Here the difference in time is > 2 times the query without order by. > > > Thanks > > On Thursday, October 31, 2019, 08:21:08 AM GMT+5:30, Tim Armstrong < > tarmstr...@cloudera.com> wrote: > > > > 1) Why is Impala slow with order by? > Impala's sort implementation is generally very fast, but the bottleneck > for a query can be in many different places. > > Like Shant said, the query profile will contain all the info needed to > determine where time was spent in the query. We don't have enough > information now to suggest why your query is performing the way it is. I'm > surprised that a sort operation on a single row would make that much > difference in query runtime. > > > 2) Can order by be made faster in clustered mode, that mean made to > be parallelized ? > Yes, sorts are parallelised across the cluster, *except* for the final > merge stage where the rows are collected and streamed back to the client. > > If you want to play around with increasing the parallelism of the sort, > there is a query option mt_dop that switches to a different multithreading > mode where you can increase the number of threads executing the plan, > including sort. E.g. if you set mt_dop=4 then on each node you will get 4 > threads doing the sort (assuming there are enough input files that the work > can be partitioned). > > > 3) Is it a good idea to use order by with Impala? if so have any body > use it with a larger data set with good performance. > If you're running a SELECT with an ORDER BY that returns a lot of rows, > the bottleneck is most likely the output - Impala can probably sort the > rows much faster than the client can fetch them. > > This actually makes it kinda hard to benchmark sort performance. We have > some targeted benchmarks that use analytic functions to test sort > performance to avoid the problem (it works because the query plans for > analytic functions involve a sort). See > https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test > > > 4) Is there any other solutions to do fast order by queries within few > seconds. (Interactive query engines) > This is exactly what Impala is good at. > > On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian <sh...@arcadiadata.com> > wrote: > > Hi Dinushka! > > That's awesome you're working with Apache Impala for your internship! You > should know the Impala website provides some incredible documentation. In > particular check out the performance section to learn about how to > benchmark queries. > https://impala.apache.org/docs/build/html/topics/impala_performance.html > > As you mentioned since it's in "standalone mode" all sorts of things can > be going, however the easiest way to get some insights to do look at the > explain plan and profile for the queries you run. For examples take a look > at this page > https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html > > Any chance you can share the EXPLAIN and PROFILE statement output of your > queries? You should also take a look at the output of the SUMMARY statement > for yourself in case anything obvious stands out to you. > > Thanks! > > On Wed, Oct 30, 2019 at 9:10 PM Dinushka <dinushk...@yahoo.com> wrote: > > Hi, > > I'm a student doing an internship, I have been given a task to do DB > performance testing for kudu with Impala for our data and use case. > > Sample dataset is about 150 million records with 150 columns and total > size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by > hash (X =4,Y=2,Z=2) > > SQL 1= "select A from table where G="value"" > SQL 2= "select A from table where G="value" order by Z" > > I'm testing kudu and Impala in standalone mode and have 2 queries which > will only return one row. One with "order by" and other without "order by" > . > > When I do testing, I found that Impala with order by is about 15% to 35% > slow. when you have order by in the SQL. > > In large row counts queries, it's time can be about 2-20 times more. > > 1) Why is Impala slow with order by? > > 2) Can order by be made faster in clustered mode, that mean made to be > parallelized > ? > > 3) Is it a good idea to use order by with Impala? if so have any body use > it with a larger data set with good performance. > > 4) Is there any other solutions to do fast order by queries within few > seconds. (Interactive query engines) > > > Thank you > >