Re: Impala slow at order by
> SQL 1= "select A from table where G="value"" > SQL 2= "select A from table where G="value" order by Z" Note that SQL 2 does not only add an ORDER BY, but also another column to read, Z, so 3 columns have to be read instead of 2. This leads to extra work for both Impala and Kudu (it is a column oriented storage engine, so has to read more data if there are more columns). A fairer way to test the effect of order by would be to always add Z to the query, e.g. select A, Z from table where G="value" select A, Z from table where G="value" order by Z On Thu, Oct 31, 2019 at 3:51 AM Tim Armstrong wrote: > > 1) Why is Impala slow with order by? > Impala's sort implementation is generally very fast, but the bottleneck > for a query can be in many different places. > > Like Shant said, the query profile will contain all the info needed to > determine where time was spent in the query. We don't have enough > information now to suggest why your query is performing the way it is. I'm > surprised that a sort operation on a single row would make that much > difference in query runtime. > > > 2) Can order by be made faster in clustered mode, that mean made to > be parallelized ? > Yes, sorts are parallelised across the cluster, *except* for the final > merge stage where the rows are collected and streamed back to the client. > > If you want to play around with increasing the parallelism of the sort, > there is a query option mt_dop that switches to a different multithreading > mode where you can increase the number of threads executing the plan, > including sort. E.g. if you set mt_dop=4 then on each node you will get 4 > threads doing the sort (assuming there are enough input files that the work > can be partitioned). > > > 3) Is it a good idea to use order by with Impala? if so have any body > use it with a larger data set with good performance. > If you're running a SELECT with an ORDER BY that returns a lot of rows, > the bottleneck is most likely the output - Impala can probably sort the > rows much faster than the client can fetch them. > > This actually makes it kinda hard to benchmark sort performance. We have > some targeted benchmarks that use analytic functions to test sort > performance to avoid the problem (it works because the query plans for > analytic functions involve a sort). See > https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test > > > 4) Is there any other solutions to do fast order by queries within few > seconds. (Interactive query engines) > This is exactly what Impala is good at. > > On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian > wrote: > >> Hi Dinushka! >> >> That's awesome you're working with Apache Impala for your internship! You >> should know the Impala website provides some incredible documentation. In >> particular check out the performance section to learn about how to >> benchmark queries. >> https://impala.apache.org/docs/build/html/topics/impala_performance.html >> >> As you mentioned since it's in "standalone mode" all sorts of things can >> be going, however the easiest way to get some insights to do look at the >> explain plan and profile for the queries you run. For examples take a look >> at this page >> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html >> >> Any chance you can share the EXPLAIN and PROFILE statement output of your >> queries? You should also take a look at the output of the SUMMARY statement >> for yourself in case anything obvious stands out to you. >> >> Thanks! >> >> On Wed, Oct 30, 2019 at 9:10 PM Dinushka wrote: >> >>> Hi, >>> >>> I'm a student doing an internship, I have been given a task to do DB >>> performance testing for kudu with Impala for our data and use case. >>> >>> Sample dataset is about 150 million records with 150 columns and total >>> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by >>> hash (X =4,Y=2,Z=2) >>> >>> SQL 1= "select A from table where G="value"" >>> SQL 2= "select A from table where G="value" order by Z" >>> >>> I'm testing kudu and Impala in standalone mode and have 2 queries which >>> will only return one row. One with "order by" and other without "order >>> by". >>> >>> When I do testing, I found that Impala with order by is about 15% to >>> 35% slow. when you have order by in the SQL. >>> >>> In large row counts queries, it's time can be about 2-20 times more. >>> >>> 1) Why is Impala slow with order by? >>> >>> 2) Can order by be made faster in clustered mode, that mean made to be >>> parallelized >>> ? >>> >>> 3) Is it a good idea to use order by with Impala? if so have any body >>> use it with a larger data set with good performance. >>> >>> 4) Is there any other solutions to do fast order by queries within few >>> seconds. (Interactive query engines) >>> >>> >>> Thank you >>> >>
Re: Impala slow at order by
A large chunk of the query time is spent in codegen (i.e. compiling the query to machine code). This would explain the difference you're seeing. CodeGen:(Total: 95.989ms, non-child: 95.989ms, % non-child: 100.00%) You should compute statistics on the tables you're querying for optimal performance. In this case I'm pretty sure the query will get faster because we'll disable codegen based on the query row count. You need to run: compute stats T_table; In general, Impala is optimised more for querying larger data sets rather than minimising latency for < 500ms queries - once you get down to a couple of hundred milliseconds startup time of the query plays a pretty big role. So you might need to run queries on bigger data sets to get more representative numbers. On Fri, Nov 1, 2019 at 4:31 AM Boris Tyukin wrote: > Did you try to remove limit clause? It seems from a plan that limit is > applied during exchange > > On Fri, Nov 1, 2019, 00:00 Dinushka wrote: > >> Hi, >> >> Thank you very much for the advice. I have attached the profile logs. >> Each table is have 160 columns. each will only return one row and only >> difference between the queries is order by. Here the difference in time is >> 2 times the query without order by. >> >> >> Thanks >> >> On Thursday, October 31, 2019, 08:21:08 AM GMT+5:30, Tim Armstrong < >> tarmstr...@cloudera.com> wrote: >> >> >> > 1) Why is Impala slow with order by? >> Impala's sort implementation is generally very fast, but the bottleneck >> for a query can be in many different places. >> >> Like Shant said, the query profile will contain all the info needed to >> determine where time was spent in the query. We don't have enough >> information now to suggest why your query is performing the way it is. I'm >> surprised that a sort operation on a single row would make that much >> difference in query runtime. >> >> > 2) Can order by be made faster in clustered mode, that mean made to >> be parallelized ? >> Yes, sorts are parallelised across the cluster, *except* for the final >> merge stage where the rows are collected and streamed back to the client. >> >> If you want to play around with increasing the parallelism of the sort, >> there is a query option mt_dop that switches to a different multithreading >> mode where you can increase the number of threads executing the plan, >> including sort. E.g. if you set mt_dop=4 then on each node you will get 4 >> threads doing the sort (assuming there are enough input files that the work >> can be partitioned). >> >> > 3) Is it a good idea to use order by with Impala? if so have any body >> use it with a larger data set with good performance. >> If you're running a SELECT with an ORDER BY that returns a lot of rows, >> the bottleneck is most likely the output - Impala can probably sort the >> rows much faster than the client can fetch them. >> >> This actually makes it kinda hard to benchmark sort performance. We have >> some targeted benchmarks that use analytic functions to test sort >> performance to avoid the problem (it works because the query plans for >> analytic functions involve a sort). See >> https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test >> >> > 4) Is there any other solutions to do fast order by queries within few >> seconds. (Interactive query engines) >> This is exactly what Impala is good at. >> >> On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian >> wrote: >> >> Hi Dinushka! >> >> That's awesome you're working with Apache Impala for your internship! You >> should know the Impala website provides some incredible documentation. In >> particular check out the performance section to learn about how to >> benchmark queries. >> https://impala.apache.org/docs/build/html/topics/impala_performance.html >> >> As you mentioned since it's in "standalone mode" all sorts of things can >> be going, however the easiest way to get some insights to do look at the >> explain plan and profile for the queries you run. For examples take a look >> at this page >> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html >> >> Any chance you can share the EXPLAIN and PROFILE statement output of your >> queries? You should also take a look at the output of the SUMMARY statement >> for yourself in case anything obvious stands out to you. >> >> Thanks! >> >> On Wed, Oct 30, 2019 at 9:10 PM Dinushka wrote: >> >> Hi, >> >> I'm a student doing an internship, I have been given a task to do DB >> performance testing for kudu with Impala for our data and use case. >> >> Sample dataset is about 150 million records with 150 columns and total >> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by >> hash (X =4,Y=2,Z=2) >> >> SQL 1= "select A from table where G="value"" >> SQL 2= "select A from table where G="value" order by Z" >> >> I'm testing kudu and Impala in standalone mode and have 2 queries which >> will only return one
Re: Impala slow at order by
Did you try to remove limit clause? It seems from a plan that limit is applied during exchange On Fri, Nov 1, 2019, 00:00 Dinushka wrote: > Hi, > > Thank you very much for the advice. I have attached the profile logs. Each > table is have 160 columns. each will only return one row and only > difference between the queries is order by. Here the difference in time is > 2 times the query without order by. > > > Thanks > > On Thursday, October 31, 2019, 08:21:08 AM GMT+5:30, Tim Armstrong < > tarmstr...@cloudera.com> wrote: > > > > 1) Why is Impala slow with order by? > Impala's sort implementation is generally very fast, but the bottleneck > for a query can be in many different places. > > Like Shant said, the query profile will contain all the info needed to > determine where time was spent in the query. We don't have enough > information now to suggest why your query is performing the way it is. I'm > surprised that a sort operation on a single row would make that much > difference in query runtime. > > > 2) Can order by be made faster in clustered mode, that mean made to > be parallelized ? > Yes, sorts are parallelised across the cluster, *except* for the final > merge stage where the rows are collected and streamed back to the client. > > If you want to play around with increasing the parallelism of the sort, > there is a query option mt_dop that switches to a different multithreading > mode where you can increase the number of threads executing the plan, > including sort. E.g. if you set mt_dop=4 then on each node you will get 4 > threads doing the sort (assuming there are enough input files that the work > can be partitioned). > > > 3) Is it a good idea to use order by with Impala? if so have any body > use it with a larger data set with good performance. > If you're running a SELECT with an ORDER BY that returns a lot of rows, > the bottleneck is most likely the output - Impala can probably sort the > rows much faster than the client can fetch them. > > This actually makes it kinda hard to benchmark sort performance. We have > some targeted benchmarks that use analytic functions to test sort > performance to avoid the problem (it works because the query plans for > analytic functions involve a sort). See > https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test > > > 4) Is there any other solutions to do fast order by queries within few > seconds. (Interactive query engines) > This is exactly what Impala is good at. > > On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian > wrote: > > Hi Dinushka! > > That's awesome you're working with Apache Impala for your internship! You > should know the Impala website provides some incredible documentation. In > particular check out the performance section to learn about how to > benchmark queries. > https://impala.apache.org/docs/build/html/topics/impala_performance.html > > As you mentioned since it's in "standalone mode" all sorts of things can > be going, however the easiest way to get some insights to do look at the > explain plan and profile for the queries you run. For examples take a look > at this page > https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html > > Any chance you can share the EXPLAIN and PROFILE statement output of your > queries? You should also take a look at the output of the SUMMARY statement > for yourself in case anything obvious stands out to you. > > Thanks! > > On Wed, Oct 30, 2019 at 9:10 PM Dinushka wrote: > > Hi, > > I'm a student doing an internship, I have been given a task to do DB > performance testing for kudu with Impala for our data and use case. > > Sample dataset is about 150 million records with 150 columns and total > size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by > hash (X =4,Y=2,Z=2) > > SQL 1= "select A from table where G="value"" > SQL 2= "select A from table where G="value" order by Z" > > I'm testing kudu and Impala in standalone mode and have 2 queries which > will only return one row. One with "order by" and other without "order by" > . > > When I do testing, I found that Impala with order by is about 15% to 35% > slow. when you have order by in the SQL. > > In large row counts queries, it's time can be about 2-20 times more. > > 1) Why is Impala slow with order by? > > 2) Can order by be made faster in clustered mode, that mean made to be > parallelized > ? > > 3) Is it a good idea to use order by with Impala? if so have any body use > it with a larger data set with good performance. > > 4) Is there any other solutions to do fast order by queries within few > seconds. (Interactive query engines) > > > Thank you > >
Re: Impala slow at order by
Hi, Thank you very much for the advice. I have attached the profile logs. Each table is have 160 columns. each will only return one row and only difference between the queries is order by. Here the difference in time is 2 times the query without order by. Thanks On Thursday, October 31, 2019, 08:21:08 AM GMT+5:30, Tim Armstrong wrote: > 1) Why is Impala slow with order by?Impala's sort implementation is > generally very fast, but the bottleneck for a query can be in many different > places. Like Shant said, the query profile will contain all the info needed to determine where time was spent in the query. We don't have enough information now to suggest why your query is performing the way it is. I'm surprised that a sort operation on a single row would make that much difference in query runtime. > 2) Can order by be made faster in clustered mode, that mean made to be > parallelized ? Yes, sorts are parallelised across the cluster, *except* for > the final merge stage where the rows are collected and streamed back to the > client. If you want to play around with increasing the parallelism of the sort, there is a query option mt_dop that switches to a different multithreading mode where you can increase the number of threads executing the plan, including sort. E.g. if you set mt_dop=4 then on each node you will get 4 threads doing the sort (assuming there are enough input files that the work can be partitioned). > 3) Is it a good idea to use order by with Impala? if so have any body use it > with a larger data set with good performance.If you're running a SELECT with > an ORDER BY that returns a lot of rows, the bottleneck is most likely the > output - Impala can probably sort the rows much faster than the client can > fetch them. This actually makes it kinda hard to benchmark sort performance. We have some targeted benchmarks that use analytic functions to test sort performance to avoid the problem (it works because the query plans for analytic functions involve a sort). See https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test > 4) Is there any other solutions to do fast order by queries within few >seconds. (Interactive query engines)This is exactly what Impala is good at. On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian wrote: Hi Dinushka! That's awesome you're working with Apache Impala for your internship! You should know the Impala website provides some incredible documentation. In particular check out the performance section to learn about how to benchmark queries. https://impala.apache.org/docs/build/html/topics/impala_performance.html As you mentioned since it's in "standalone mode" all sorts of things can be going, however the easiest way to get some insights to do look at the explain plan and profile for the queries you run. For examples take a look at this page https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html Any chance you can share the EXPLAIN and PROFILE statement output of your queries? You should also take a look at the output of the SUMMARY statement for yourself in case anything obvious stands out to you. Thanks! On Wed, Oct 30, 2019 at 9:10 PM Dinushka wrote: Hi, I'm a student doing an internship, I have been given a task to do DB performance testing for kudu with Impala for our data and use case. Sample dataset is about 150 million records with 150 columns and total size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by hash (X =4,Y=2,Z=2) SQL 1= "select A from table where G="value""SQL 2= "select A from table where G="value" order by Z" I'm testing kudu and Impala in standalone mode and have 2 queries which will only return one row. One with "order by" and other without "order by". When I do testing, I found that Impala with order by is about 15% to 35% slow. when you have order by in the SQL. In large row counts queries, it's time can be about 2-20 times more. 1) Why is Impala slow with order by? 2) Can order by be made faster in clustered mode, that mean made to be parallelized ? 3) Is it a good idea to use order by with Impala? if so have any body use it with a larger data set with good performance. 4) Is there any other solutions to do fast order by queries within few seconds. (Interactive query engines) Thank you Query (id=394a24077d5c40a3:4cf8c0e2): Summary: Session ID: 914d568889734906:57644ff50565a6a9 Session Type: BEESWAX Start Time: 2019-11-01 02:45:54.653821000 End Time: 2019-11-01 02:45:54.759479000 Query Type: QUERY Query State: FINISHED Query Status: OK Impala Version: impalad version 3.2.0-cdh6.3.0 RELEASE (build 495397101e5807c701df71ea288f4815d69c2c8a) User: user-s Connected User: user-s Delegated User: Network Address: :::10.10.20.10:57757 Default Db: default Sql Statement: SELECT A_ID, B_ID from T_table where
Re: Impala slow at order by
> 1) Why is Impala slow with order by? Impala's sort implementation is generally very fast, but the bottleneck for a query can be in many different places. Like Shant said, the query profile will contain all the info needed to determine where time was spent in the query. We don't have enough information now to suggest why your query is performing the way it is. I'm surprised that a sort operation on a single row would make that much difference in query runtime. > 2) Can order by be made faster in clustered mode, that mean made to be parallelized ? Yes, sorts are parallelised across the cluster, *except* for the final merge stage where the rows are collected and streamed back to the client. If you want to play around with increasing the parallelism of the sort, there is a query option mt_dop that switches to a different multithreading mode where you can increase the number of threads executing the plan, including sort. E.g. if you set mt_dop=4 then on each node you will get 4 threads doing the sort (assuming there are enough input files that the work can be partitioned). > 3) Is it a good idea to use order by with Impala? if so have any body use it with a larger data set with good performance. If you're running a SELECT with an ORDER BY that returns a lot of rows, the bottleneck is most likely the output - Impala can probably sort the rows much faster than the client can fetch them. This actually makes it kinda hard to benchmark sort performance. We have some targeted benchmarks that use analytic functions to test sort performance to avoid the problem (it works because the query plans for analytic functions involve a sort). See https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test > 4) Is there any other solutions to do fast order by queries within few seconds. (Interactive query engines) This is exactly what Impala is good at. On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian wrote: > Hi Dinushka! > > That's awesome you're working with Apache Impala for your internship! You > should know the Impala website provides some incredible documentation. In > particular check out the performance section to learn about how to > benchmark queries. > https://impala.apache.org/docs/build/html/topics/impala_performance.html > > As you mentioned since it's in "standalone mode" all sorts of things can > be going, however the easiest way to get some insights to do look at the > explain plan and profile for the queries you run. For examples take a look > at this page > https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html > > Any chance you can share the EXPLAIN and PROFILE statement output of your > queries? You should also take a look at the output of the SUMMARY statement > for yourself in case anything obvious stands out to you. > > Thanks! > > On Wed, Oct 30, 2019 at 9:10 PM Dinushka wrote: > >> Hi, >> >> I'm a student doing an internship, I have been given a task to do DB >> performance testing for kudu with Impala for our data and use case. >> >> Sample dataset is about 150 million records with 150 columns and total >> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by >> hash (X =4,Y=2,Z=2) >> >> SQL 1= "select A from table where G="value"" >> SQL 2= "select A from table where G="value" order by Z" >> >> I'm testing kudu and Impala in standalone mode and have 2 queries which >> will only return one row. One with "order by" and other without "order by >> ". >> >> When I do testing, I found that Impala with order by is about 15% to 35% >> slow. when you have order by in the SQL. >> >> In large row counts queries, it's time can be about 2-20 times more. >> >> 1) Why is Impala slow with order by? >> >> 2) Can order by be made faster in clustered mode, that mean made to be >> parallelized >> ? >> >> 3) Is it a good idea to use order by with Impala? if so have any body use >> it with a larger data set with good performance. >> >> 4) Is there any other solutions to do fast order by queries within few >> seconds. (Interactive query engines) >> >> >> Thank you >> >
Re: Impala slow at order by
Hi Dinushka! That's awesome you're working with Apache Impala for your internship! You should know the Impala website provides some incredible documentation. In particular check out the performance section to learn about how to benchmark queries. https://impala.apache.org/docs/build/html/topics/impala_performance.html As you mentioned since it's in "standalone mode" all sorts of things can be going, however the easiest way to get some insights to do look at the explain plan and profile for the queries you run. For examples take a look at this page https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html Any chance you can share the EXPLAIN and PROFILE statement output of your queries? You should also take a look at the output of the SUMMARY statement for yourself in case anything obvious stands out to you. Thanks! On Wed, Oct 30, 2019 at 9:10 PM Dinushka wrote: > Hi, > > I'm a student doing an internship, I have been given a task to do DB > performance testing for kudu with Impala for our data and use case. > > Sample dataset is about 150 million records with 150 columns and total > size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by > hash (X =4,Y=2,Z=2) > > SQL 1= "select A from table where G="value"" > SQL 2= "select A from table where G="value" order by Z" > > I'm testing kudu and Impala in standalone mode and have 2 queries which > will only return one row. One with "order by" and other without "order by" > . > > When I do testing, I found that Impala with order by is about 15% to 35% > slow. when you have order by in the SQL. > > In large row counts queries, it's time can be about 2-20 times more. > > 1) Why is Impala slow with order by? > > 2) Can order by be made faster in clustered mode, that mean made to be > parallelized > ? > > 3) Is it a good idea to use order by with Impala? if so have any body use > it with a larger data set with good performance. > > 4) Is there any other solutions to do fast order by queries within few > seconds. (Interactive query engines) > > > Thank you >