Hi,
I'm a student doing an internship, I have been given a task to do DB 
performance testing for kudu with Impala for our data and use case.
Sample dataset is about 150 million records with 150 columns  and total size of 
kudu is 55GB. composite primary key (X,Y,Z) and partitioning by hash (X 
=4,Y=2,Z=2)
SQL 1= "select  A from table where  G="value""SQL 2= "select  A from table 
where  G="value" order by Z"

I'm testing kudu and Impala in standalone mode and have 2 queries which will 
only return one row. One with "order by" and other without "order by".
When I do testing, I found that Impala with order by is about 15% to 35% slow. 
when you have order by in the SQL.
In large row counts queries, it's time can be about 2-20 times more.
1) Why is Impala slow with order by?
2) Can order by  be made faster in clustered mode, that mean made to be 
parallelized ? 
3) Is it a good idea to use order by with Impala? if so have any body use it 
with a larger data set with good performance.
4) Is there any other solutions to do fast order by queries within few seconds. 
(Interactive query engines) 
Thank you

Reply via email to