Re: Impala slow at order by

2019-11-04 Thread Csaba Ringhofer
> SQL 1= "select  A from table where  G="value""
> SQL 2= "select  A from table where  G="value" order by Z"
Note that SQL 2 does not only add an ORDER BY, but also another column to
read, Z, so 3 columns have to be read instead of 2.
This leads to extra work for both Impala and Kudu (it is a column oriented
storage engine, so has to read more data if there are more columns).

A fairer way to test the effect of order by would be to always add Z to the
query, e.g.
select  A, Z from table where  G="value"
select  A, Z from table where  G="value" order by Z


On Thu, Oct 31, 2019 at 3:51 AM Tim Armstrong 
wrote:

> > 1) Why is Impala slow with order by?
> Impala's sort implementation is generally very fast, but the bottleneck
> for a query can be in many different places.
>
> Like Shant said, the query profile will contain all the info needed to
> determine where time was spent in the query. We don't have enough
> information now to suggest why your query is performing the way it is. I'm
> surprised that a sort operation on a single row would make that much
> difference in query runtime.
>
> > 2) Can order by  be made faster in clustered mode, that mean made to
> be parallelized ?
> Yes, sorts are parallelised across the cluster, *except* for the final
> merge stage where the rows are collected and streamed back to the client.
>
> If you want to play around with increasing the parallelism of the sort,
> there is a query option mt_dop that switches to a different multithreading
> mode where you can increase the number of threads executing the plan,
> including sort. E.g. if you set mt_dop=4 then on each node you will get 4
> threads doing the sort (assuming there are enough input files that the work
> can be partitioned).
>
> > 3) Is it a good idea to use order by with Impala? if so have any body
> use it with a larger data set with good performance.
> If you're running a SELECT with an ORDER BY that returns a lot of rows,
> the bottleneck is most likely the output - Impala can probably sort the
> rows much faster than the client can fetch them.
>
> This actually makes it kinda hard to benchmark sort performance. We have
> some targeted benchmarks that use analytic functions to test sort
> performance to avoid the problem (it works because the query plans for
> analytic functions involve a sort). See
> https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test
>
> > 4) Is there any other solutions to do fast order by queries within few
> seconds. (Interactive query engines)
> This is exactly what Impala is good at.
>
> On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian 
> wrote:
>
>> Hi Dinushka!
>>
>> That's awesome you're working with Apache Impala for your internship! You
>> should know the Impala website provides some incredible documentation. In
>> particular check out the performance section to learn about how to
>> benchmark queries.
>> https://impala.apache.org/docs/build/html/topics/impala_performance.html
>>
>> As you mentioned since it's in "standalone mode" all sorts of things can
>> be going, however the easiest way to get some insights to do look at the
>> explain plan and profile for the queries you run. For examples take a look
>> at this page
>> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html
>>
>> Any chance you can share the EXPLAIN and PROFILE statement output of your
>> queries? You should also take a look at the output of the SUMMARY statement
>> for yourself in case anything obvious stands out to you.
>>
>> Thanks!
>>
>> On Wed, Oct 30, 2019 at 9:10 PM Dinushka  wrote:
>>
>>> Hi,
>>>
>>> I'm a student doing an internship, I have been given a task to do DB
>>> performance testing for kudu with Impala for our data and use case.
>>>
>>> Sample dataset is about 150 million records with 150 columns  and total
>>> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by
>>> hash (X =4,Y=2,Z=2)
>>>
>>> SQL 1= "select  A from table where  G="value""
>>> SQL 2= "select  A from table where  G="value" order by Z"
>>>
>>> I'm testing kudu and Impala in standalone mode and have 2 queries which
>>> will only return one row. One with "order by" and other without "order
>>> by".
>>>
>>> When I do testing, I found that Impala with order by is about 15% to
>>> 35% slow. when you have order by in the SQL.
>>>
>>> In large row counts queries, it's time can be about 2-20 times more.
>>>
>>> 1) Why is Impala slow with order by?
>>>
>>> 2) Can order by  be made faster in clustered mode, that mean made to be 
>>> parallelized
>>> ?
>>>
>>> 3) Is it a good idea to use order by with Impala? if so have any body
>>> use it with a larger data set with good performance.
>>>
>>> 4) Is there any other solutions to do fast order by queries within few
>>> seconds. (Interactive query engines)
>>>
>>>
>>> Thank you
>>>
>>


Re: Impala slow at order by

2019-11-01 Thread Tim Armstrong
A large chunk of the query time is spent in codegen (i.e. compiling the
query to machine code). This would explain the difference you're seeing.

  CodeGen:(Total: 95.989ms, non-child: 95.989ms, % non-child: 100.00%)

You should compute statistics on the tables you're querying for
optimal performance. In this case I'm pretty sure the query will get
faster because we'll disable codegen based on the query row count. You
need to run:

  compute stats T_table;

In general, Impala is optimised more for querying larger data sets
rather than minimising latency for < 500ms queries - once you get down
to a couple of hundred milliseconds startup time of the query plays a
pretty big role.

So you might need to run queries on bigger data sets to get more
representative numbers.


On Fri, Nov 1, 2019 at 4:31 AM Boris Tyukin  wrote:

> Did you try to remove limit clause? It seems from a plan that limit is
> applied during exchange
>
> On Fri, Nov 1, 2019, 00:00 Dinushka  wrote:
>
>> Hi,
>>
>> Thank you very much for the advice. I have attached the profile logs.
>> Each table is have 160 columns. each will only return one row and only
>> difference between the queries is order by. Here the difference in time is
>> 2 times the query without order by.
>>
>>
>> Thanks
>>
>> On Thursday, October 31, 2019, 08:21:08 AM GMT+5:30, Tim Armstrong <
>> tarmstr...@cloudera.com> wrote:
>>
>>
>> > 1) Why is Impala slow with order by?
>> Impala's sort implementation is generally very fast, but the bottleneck
>> for a query can be in many different places.
>>
>> Like Shant said, the query profile will contain all the info needed to
>> determine where time was spent in the query. We don't have enough
>> information now to suggest why your query is performing the way it is. I'm
>> surprised that a sort operation on a single row would make that much
>> difference in query runtime.
>>
>> > 2) Can order by  be made faster in clustered mode, that mean made to
>> be parallelized ?
>> Yes, sorts are parallelised across the cluster, *except* for the final
>> merge stage where the rows are collected and streamed back to the client.
>>
>> If you want to play around with increasing the parallelism of the sort,
>> there is a query option mt_dop that switches to a different multithreading
>> mode where you can increase the number of threads executing the plan,
>> including sort. E.g. if you set mt_dop=4 then on each node you will get 4
>> threads doing the sort (assuming there are enough input files that the work
>> can be partitioned).
>>
>> > 3) Is it a good idea to use order by with Impala? if so have any body
>> use it with a larger data set with good performance.
>> If you're running a SELECT with an ORDER BY that returns a lot of rows,
>> the bottleneck is most likely the output - Impala can probably sort the
>> rows much faster than the client can fetch them.
>>
>> This actually makes it kinda hard to benchmark sort performance. We have
>> some targeted benchmarks that use analytic functions to test sort
>> performance to avoid the problem (it works because the query plans for
>> analytic functions involve a sort). See
>> https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test
>>
>> > 4) Is there any other solutions to do fast order by queries within few
>> seconds. (Interactive query engines)
>> This is exactly what Impala is good at.
>>
>> On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian 
>> wrote:
>>
>> Hi Dinushka!
>>
>> That's awesome you're working with Apache Impala for your internship! You
>> should know the Impala website provides some incredible documentation. In
>> particular check out the performance section to learn about how to
>> benchmark queries.
>> https://impala.apache.org/docs/build/html/topics/impala_performance.html
>>
>> As you mentioned since it's in "standalone mode" all sorts of things can
>> be going, however the easiest way to get some insights to do look at the
>> explain plan and profile for the queries you run. For examples take a look
>> at this page
>> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html
>>
>> Any chance you can share the EXPLAIN and PROFILE statement output of your
>> queries? You should also take a look at the output of the SUMMARY statement
>> for yourself in case anything obvious stands out to you.
>>
>> Thanks!
>>
>> On Wed, Oct 30, 2019 at 9:10 PM Dinushka  wrote:
>>
>> Hi,
>>
>> I'm a student doing an internship, I have been given a task to do DB
>> performance testing for kudu with Impala for our data and use case.
>>
>> Sample dataset is about 150 million records with 150 columns  and total
>> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by
>> hash (X =4,Y=2,Z=2)
>>
>> SQL 1= "select  A from table where  G="value""
>> SQL 2= "select  A from table where  G="value" order by Z"
>>
>> I'm testing kudu and Impala in standalone mode and have 2 queries which
>> will only return one 

Re: Impala slow at order by

2019-11-01 Thread Boris Tyukin
Did you try to remove limit clause? It seems from a plan that limit is
applied during exchange

On Fri, Nov 1, 2019, 00:00 Dinushka  wrote:

> Hi,
>
> Thank you very much for the advice. I have attached the profile logs. Each
> table is have 160 columns. each will only return one row and only
> difference between the queries is order by. Here the difference in time is
> 2 times the query without order by.
>
>
> Thanks
>
> On Thursday, October 31, 2019, 08:21:08 AM GMT+5:30, Tim Armstrong <
> tarmstr...@cloudera.com> wrote:
>
>
> > 1) Why is Impala slow with order by?
> Impala's sort implementation is generally very fast, but the bottleneck
> for a query can be in many different places.
>
> Like Shant said, the query profile will contain all the info needed to
> determine where time was spent in the query. We don't have enough
> information now to suggest why your query is performing the way it is. I'm
> surprised that a sort operation on a single row would make that much
> difference in query runtime.
>
> > 2) Can order by  be made faster in clustered mode, that mean made to
> be parallelized ?
> Yes, sorts are parallelised across the cluster, *except* for the final
> merge stage where the rows are collected and streamed back to the client.
>
> If you want to play around with increasing the parallelism of the sort,
> there is a query option mt_dop that switches to a different multithreading
> mode where you can increase the number of threads executing the plan,
> including sort. E.g. if you set mt_dop=4 then on each node you will get 4
> threads doing the sort (assuming there are enough input files that the work
> can be partitioned).
>
> > 3) Is it a good idea to use order by with Impala? if so have any body
> use it with a larger data set with good performance.
> If you're running a SELECT with an ORDER BY that returns a lot of rows,
> the bottleneck is most likely the output - Impala can probably sort the
> rows much faster than the client can fetch them.
>
> This actually makes it kinda hard to benchmark sort performance. We have
> some targeted benchmarks that use analytic functions to test sort
> performance to avoid the problem (it works because the query plans for
> analytic functions involve a sort). See
> https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test
>
> > 4) Is there any other solutions to do fast order by queries within few
> seconds. (Interactive query engines)
> This is exactly what Impala is good at.
>
> On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian 
> wrote:
>
> Hi Dinushka!
>
> That's awesome you're working with Apache Impala for your internship! You
> should know the Impala website provides some incredible documentation. In
> particular check out the performance section to learn about how to
> benchmark queries.
> https://impala.apache.org/docs/build/html/topics/impala_performance.html
>
> As you mentioned since it's in "standalone mode" all sorts of things can
> be going, however the easiest way to get some insights to do look at the
> explain plan and profile for the queries you run. For examples take a look
> at this page
> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html
>
> Any chance you can share the EXPLAIN and PROFILE statement output of your
> queries? You should also take a look at the output of the SUMMARY statement
> for yourself in case anything obvious stands out to you.
>
> Thanks!
>
> On Wed, Oct 30, 2019 at 9:10 PM Dinushka  wrote:
>
> Hi,
>
> I'm a student doing an internship, I have been given a task to do DB
> performance testing for kudu with Impala for our data and use case.
>
> Sample dataset is about 150 million records with 150 columns  and total
> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by
> hash (X =4,Y=2,Z=2)
>
> SQL 1= "select  A from table where  G="value""
> SQL 2= "select  A from table where  G="value" order by Z"
>
> I'm testing kudu and Impala in standalone mode and have 2 queries which
> will only return one row. One with "order by" and other without "order by"
> .
>
> When I do testing, I found that Impala with order by is about 15% to 35%
> slow. when you have order by in the SQL.
>
> In large row counts queries, it's time can be about 2-20 times more.
>
> 1) Why is Impala slow with order by?
>
> 2) Can order by  be made faster in clustered mode, that mean made to be 
> parallelized
> ?
>
> 3) Is it a good idea to use order by with Impala? if so have any body use
> it with a larger data set with good performance.
>
> 4) Is there any other solutions to do fast order by queries within few
> seconds. (Interactive query engines)
>
>
> Thank you
>
>


Re: Impala slow at order by

2019-10-31 Thread Dinushka
 Hi,

Thank you very much for the advice. I have attached the profile logs. Each 
table is have 160 columns. each will only return one row and only difference 
between the queries is order by. Here the difference in time is 2 times the 
query without order by. 

Thanks    
On Thursday, October 31, 2019, 08:21:08 AM GMT+5:30, Tim Armstrong 
 wrote:  
 
 > 1) Why is Impala slow with order by?Impala's sort implementation is 
 > generally very fast, but the bottleneck for a query can be in many different 
 > places.
Like Shant said, the query profile will contain all the info needed to 
determine where time was spent in the query. We don't have enough information 
now to suggest why your query is performing the way it is. I'm surprised that a 
sort operation on a single row would make that much difference in query runtime.
> 2) Can order by  be made faster in clustered mode, that mean made to be 
> parallelized ? Yes, sorts are parallelised across the cluster, *except* for 
> the final merge stage where the rows are collected and streamed back to the 
> client.
If you want to play around with increasing the parallelism of the sort, there 
is a query option mt_dop that switches to a different multithreading mode where 
you can increase the number of threads executing the plan, including sort. E.g. 
if you set mt_dop=4 then on each node you will get 4 threads doing the sort 
(assuming there are enough input files that the work can be partitioned).
> 3) Is it a good idea to use order by with Impala? if so have any body use it 
> with a larger data set with good performance.If you're running a SELECT with 
> an ORDER BY that returns a lot of rows, the bottleneck is most likely the 
> output - Impala can probably sort the rows much faster than the client can 
> fetch them.
This actually makes it kinda hard to benchmark sort performance. We have some 
targeted benchmarks that use analytic functions to test sort performance to 
avoid the problem (it works because the query plans for analytic functions 
involve a sort). See 
https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test
> 4) Is there any other solutions to do fast order by queries within few 
>seconds. (Interactive query engines)This is exactly what Impala is good at.
On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian  wrote:

Hi Dinushka!
That's awesome you're working with Apache Impala for your internship! You 
should know the Impala website provides some incredible documentation. In 
particular check out the performance section to learn about how to benchmark 
queries. 
https://impala.apache.org/docs/build/html/topics/impala_performance.html
As you mentioned since it's in "standalone mode" all sorts of things can be 
going, however the easiest way to get some insights to do look at the explain 
plan and profile for the queries you run. For examples take a look at this page 
https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html
Any chance you can share the EXPLAIN and PROFILE statement output of your 
queries? You should also take a look at the output of the SUMMARY statement for 
yourself in case anything obvious stands out to you.
Thanks!
On Wed, Oct 30, 2019 at 9:10 PM Dinushka  wrote:

Hi,
I'm a student doing an internship, I have been given a task to do DB 
performance testing for kudu with Impala for our data and use case.
Sample dataset is about 150 million records with 150 columns  and total size of 
kudu is 55GB. composite primary key (X,Y,Z) and partitioning by hash (X 
=4,Y=2,Z=2)
SQL 1= "select  A from table where  G="value""SQL 2= "select  A from table 
where  G="value" order by Z"

I'm testing kudu and Impala in standalone mode and have 2 queries which will 
only return one row. One with "order by" and other without "order by".
When I do testing, I found that Impala with order by is about 15% to 35% slow. 
when you have order by in the SQL.
In large row counts queries, it's time can be about 2-20 times more.
1) Why is Impala slow with order by?
2) Can order by  be made faster in clustered mode, that mean made to be 
parallelized ? 
3) Is it a good idea to use order by with Impala? if so have any body use it 
with a larger data set with good performance.
4) Is there any other solutions to do fast order by queries within few seconds. 
(Interactive query engines) 
Thank you

  Query (id=394a24077d5c40a3:4cf8c0e2):
  Summary:
Session ID: 914d568889734906:57644ff50565a6a9
Session Type: BEESWAX
Start Time: 2019-11-01 02:45:54.653821000
End Time: 2019-11-01 02:45:54.759479000
Query Type: QUERY
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 3.2.0-cdh6.3.0 RELEASE (build 
495397101e5807c701df71ea288f4815d69c2c8a)
User: user-s
Connected User: user-s
Delegated User: 
Network Address: :::10.10.20.10:57757
Default Db: default
Sql Statement: SELECT A_ID, B_ID from T_table where 

Re: Impala slow at order by

2019-10-30 Thread Tim Armstrong
> 1) Why is Impala slow with order by?
Impala's sort implementation is generally very fast, but the bottleneck for
a query can be in many different places.

Like Shant said, the query profile will contain all the info needed to
determine where time was spent in the query. We don't have enough
information now to suggest why your query is performing the way it is. I'm
surprised that a sort operation on a single row would make that much
difference in query runtime.

> 2) Can order by  be made faster in clustered mode, that mean made to
be parallelized ?
Yes, sorts are parallelised across the cluster, *except* for the final
merge stage where the rows are collected and streamed back to the client.

If you want to play around with increasing the parallelism of the sort,
there is a query option mt_dop that switches to a different multithreading
mode where you can increase the number of threads executing the plan,
including sort. E.g. if you set mt_dop=4 then on each node you will get 4
threads doing the sort (assuming there are enough input files that the work
can be partitioned).

> 3) Is it a good idea to use order by with Impala? if so have any body use
it with a larger data set with good performance.
If you're running a SELECT with an ORDER BY that returns a lot of rows, the
bottleneck is most likely the output - Impala can probably sort the rows
much faster than the client can fetch them.

This actually makes it kinda hard to benchmark sort performance. We have
some targeted benchmarks that use analytic functions to test sort
performance to avoid the problem (it works because the query plans for
analytic functions involve a sort). See
https://github.com/apache/impala/blob/master/testdata/workloads/targeted-perf/queries/primitive_orderby_bigint.test

> 4) Is there any other solutions to do fast order by queries within few
seconds. (Interactive query engines)
This is exactly what Impala is good at.

On Wed, Oct 30, 2019 at 7:07 PM Shant Hovsepian 
wrote:

> Hi Dinushka!
>
> That's awesome you're working with Apache Impala for your internship! You
> should know the Impala website provides some incredible documentation. In
> particular check out the performance section to learn about how to
> benchmark queries.
> https://impala.apache.org/docs/build/html/topics/impala_performance.html
>
> As you mentioned since it's in "standalone mode" all sorts of things can
> be going, however the easiest way to get some insights to do look at the
> explain plan and profile for the queries you run. For examples take a look
> at this page
> https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html
>
> Any chance you can share the EXPLAIN and PROFILE statement output of your
> queries? You should also take a look at the output of the SUMMARY statement
> for yourself in case anything obvious stands out to you.
>
> Thanks!
>
> On Wed, Oct 30, 2019 at 9:10 PM Dinushka  wrote:
>
>> Hi,
>>
>> I'm a student doing an internship, I have been given a task to do DB
>> performance testing for kudu with Impala for our data and use case.
>>
>> Sample dataset is about 150 million records with 150 columns  and total
>> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by
>> hash (X =4,Y=2,Z=2)
>>
>> SQL 1= "select  A from table where  G="value""
>> SQL 2= "select  A from table where  G="value" order by Z"
>>
>> I'm testing kudu and Impala in standalone mode and have 2 queries which
>> will only return one row. One with "order by" and other without "order by
>> ".
>>
>> When I do testing, I found that Impala with order by is about 15% to 35%
>> slow. when you have order by in the SQL.
>>
>> In large row counts queries, it's time can be about 2-20 times more.
>>
>> 1) Why is Impala slow with order by?
>>
>> 2) Can order by  be made faster in clustered mode, that mean made to be 
>> parallelized
>> ?
>>
>> 3) Is it a good idea to use order by with Impala? if so have any body use
>> it with a larger data set with good performance.
>>
>> 4) Is there any other solutions to do fast order by queries within few
>> seconds. (Interactive query engines)
>>
>>
>> Thank you
>>
>


Re: Impala slow at order by

2019-10-30 Thread Shant Hovsepian
Hi Dinushka!

That's awesome you're working with Apache Impala for your internship! You
should know the Impala website provides some incredible documentation. In
particular check out the performance section to learn about how to
benchmark queries.
https://impala.apache.org/docs/build/html/topics/impala_performance.html

As you mentioned since it's in "standalone mode" all sorts of things can be
going, however the easiest way to get some insights to do look at the
explain plan and profile for the queries you run. For examples take a look
at this page
https://impala.apache.org/docs/build/html/topics/impala_explain_plan.html

Any chance you can share the EXPLAIN and PROFILE statement output of your
queries? You should also take a look at the output of the SUMMARY statement
for yourself in case anything obvious stands out to you.

Thanks!

On Wed, Oct 30, 2019 at 9:10 PM Dinushka  wrote:

> Hi,
>
> I'm a student doing an internship, I have been given a task to do DB
> performance testing for kudu with Impala for our data and use case.
>
> Sample dataset is about 150 million records with 150 columns  and total
> size of kudu is 55GB. composite primary key (X,Y,Z) and partitioning by
> hash (X =4,Y=2,Z=2)
>
> SQL 1= "select  A from table where  G="value""
> SQL 2= "select  A from table where  G="value" order by Z"
>
> I'm testing kudu and Impala in standalone mode and have 2 queries which
> will only return one row. One with "order by" and other without "order by"
> .
>
> When I do testing, I found that Impala with order by is about 15% to 35%
> slow. when you have order by in the SQL.
>
> In large row counts queries, it's time can be about 2-20 times more.
>
> 1) Why is Impala slow with order by?
>
> 2) Can order by  be made faster in clustered mode, that mean made to be 
> parallelized
> ?
>
> 3) Is it a good idea to use order by with Impala? if so have any body use
> it with a larger data set with good performance.
>
> 4) Is there any other solutions to do fast order by queries within few
> seconds. (Interactive query engines)
>
>
> Thank you
>