Found the reason from profiles. It is again about the exchange. Noshuffle helped a lot. Because when you do create table parq as select * from kudu180M it scans kudu, writes directly to HDFS. When you do insert into parq partition (year) select * from kudu180M where partition=2018 then it just reads 45M rows, but the exchange hashes the rows, so it is slower.
On 2018/07/31 20:59:28, Mike Percy <[email protected]> wrote: > Can you post a query profile from Impala for one of the slow insert jobs? > > Mike > > On Tue, Jul 31, 2018 at 12:56 PM Tomas Farkas <[email protected]> wrote: > > > Hi, > > wanted share with you the preliminary results of my Kudu testing on AWS > > Created a set of performance tests for evaluation of different instance > > types in AWS and different configurations (Kudu separated from Impala, Kudu > > and Impala on the same nodes); different drive (st1 and gp2) settings and > > here my results: > > > > I was quite dissapointed by the inserts in Step3 see attached sqls, > > > > Any hints, ideas, why this does not scale? > > Thanks > > > > > > >
