Re: poor performance on insert into range partitions and scaling

2018-08-02 Thread farkas
Found the reason from profiles. It is again about the exchange. Noshuffle 
helped a lot. Because when you do create table parq as select * from kudu180M 
it scans kudu, writes directly to HDFS. When you do insert into parq partition 
(year) select * from kudu180M where partition=2018 then it just reads 45M rows, 
but the exchange hashes the rows, so it is slower.

On 2018/07/31 20:59:28, Mike Percy  wrote: 
> Can you post a query profile from Impala for one of the slow insert jobs?
> 
> Mike
> 
> On Tue, Jul 31, 2018 at 12:56 PM Tomas Farkas  wrote:
> 
> > Hi,
> > wanted share with you the preliminary results of my Kudu testing on AWS
> > Created a set of performance tests for evaluation of different instance
> > types in AWS and different configurations (Kudu separated from Impala, Kudu
> > and Impala on the same nodes); different drive (st1 and gp2) settings and
> > here my results:
> >
> > I was quite dissapointed by the inserts in Step3 see attached sqls,
> >
> > Any hints, ideas, why this does not scale?
> > Thanks
> >
> >
> >
> 


Re: poor performance on insert into range partitions and scaling

2018-08-01 Thread farkas



On 2018/07/31 20:59:28, Mike Percy  wrote: 
> Can you post a query profile from Impala for one of the slow insert jobs?
> 
> Mike
> 
> On Tue, Jul 31, 2018 at 12:56 PM Tomas Farkas  wrote:
> 
> > Hi,
> > wanted share with you the preliminary results of my Kudu testing on AWS
> > Created a set of performance tests for evaluation of different instance
> > types in AWS and different configurations (Kudu separated from Impala, Kudu
> > and Impala on the same nodes); different drive (st1 and gp2) settings and
> > here my results:
> >
> > I was quite dissapointed by the inserts in Step3 see attached sqls,
> >
> > Any hints, ideas, why this does not scale?
> > Thanks
> >
> >
> >
> 


Impala profile:

Query (id=7e441f58868a8585:74520210):
  Summary:
Session ID: be47b44e2dde497b:fd4a9bd251e2c897
Session Type: BEESWAX
Start Time: 2018-08-01 15:40:30.976897000
End Time: 2018-08-01 15:43:17.236925000
Query Type: DML
Query State: FINISHED
Query Status: OK
Impala Version: impalad version 2.12.0-cdh5.15.0 RELEASE (build 
23f574543323301846b41fa5433690df32efe085)
User: impala@KUDUTEST.DOMAIN.LOCAL
Connected User: impala@KUDUTEST.DOMAIN.LOCAL
Delegated User: 
Network Address: 10.197.7.176:45846
Default Db: default
Sql Statement: insert into test.test_kudu_range 
select 
  id,
  name,
  ban,
  2018,
  iban,
  eventdate,
  eventid,
  city, 
  state,
  lat,
  lon
from test.test_prq
Coordinator: ip-10-197-10-88.eu-west-1.compute.internal:22000
Query Options (set by configuration): 
Query Options (set by configuration and planner): MT_DOP=0
Plan: 

Max Per-Host Resource Reservation: Memory=4.00MB
Per-Host Resource Estimates: Memory=832.00MB

F01:PLAN FRAGMENT [KUDU(KuduPartition(id, name, ban, 2018))] hosts=6 instances=6
|  Per-Host Resources: mem-estimate=128.00MB mem-reservation=4.00MB
INSERT INTO KUDU [test.test_kudu_range]
|  mem-estimate=0B mem-reservation=0B
|
02:PARTIAL SORT
|  order by: KuduPartition(id, name, ban, 2018) ASC NULLS LAST, id ASC NULLS 
LAST, name ASC NULLS LAST, ban ASC NULLS LAST
|  materialized: KuduPartition(id, name, ban, 2018)
|  mem-estimate=128.00MB mem-reservation=4.00MB spill-buffer=2.00MB
|  tuple-ids=2 row-size=248B cardinality=4500
|
01:EXCHANGE [KUDU(KuduPartition(id, name, ban, 2018))]
|  mem-estimate=0B mem-reservation=0B
|  tuple-ids=0 row-size=244B cardinality=4500
|
F00:PLAN FRAGMENT [RANDOM] hosts=6 instances=6
Per-Host Resources: mem-estimate=704.00MB mem-reservation=0B
00:SCAN HDFS [test.test_prq, RANDOM]
   partitions=1/1 files=24 size=4.93GB
   stored statistics:
 table: rows=4500 size=4.93GB
 columns: all
   extrapolated-rows=disabled
   mem-estimate=704.00MB mem-reservation=0B
   tuple-ids=0 row-size=244B cardinality=4500

Estimated Per-Host Mem: 872415232
Per Host Min Reservation: 
ip-10-197-10-88.eu-west-1.compute.internal:22000(4.00 MB) 
ip-10-197-11-142.eu-west-1.compute.internal:22000(4.00 MB) 
ip-10-197-29-94.eu-west-1.compute.internal:22000(4.00 MB) 
ip-10-197-3-207.eu-west-1.compute.internal:22000(4.00 MB) 
ip-10-197-30-21.eu-west-1.compute.internal:22000(4.00 MB) 
ip-10-197-7-97.eu-west-1.compute.internal:22000(4.00 MB) 
Request Pool: root.impala
Admission result: Admitted immediately
ExecSummary: 
Operator  #Hosts   Avg Time   Max Time   #Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail   
---
02:PARTIAL SORT6   12s495ms   26s075ms  45.00M  45.00M3.49 GB   
   128.00 MB   
01:EXCHANGE6  232.833ms  517.001ms  45.00M  45.00M   13.72 MB   
   0  KUDU(KuduPartition(id, name, ban, 2018)) 
00:SCAN HDFS   6   96.500ms  117.000ms  45.00M  45.00M  774.11 MB   
   704.00 MB  test.test_prq
Errors: Key already present in Kudu table 'impala::test.test_kudu_range'. 
(1 of 15442 similar)

Query Compilation: 4.022ms
   - Metadata of all 2 tables cached: 380.433us (380.433us)
   - Analysis finished: 1.141ms (761.253us)
   - Value transfer graph computed: 1.254ms (113.129us)
   - Single node plan created: 1.780ms (525.298us)
   - Runtime filters computed: 1.819ms (39.444us)
   - Distributed plan created: 2.508ms (688.869us)
   - Lineage info computed: 2.678ms (169.662us)
   - Planning finished: 4.022ms (1.343ms)
Query Timeline: 2m46s
   - Query submitted: 0.000ns (0.000ns)
   - Planning finished: 5.000ms (5.000ms)
   - Submit for admission: 6.000ms (1.000ms)
   - Completed admission: 6.000ms (0.000ns)
   - Ready to start on 6 backends: 6.000ms (0.000ns)
   - All 6 execution backends (12 fragment instances) started: 7.000ms 
(1.000ms)
   - Released admission control resourc

Re: poor performance on insert into range partitions and scaling

2018-07-31 Thread Mike Percy
Can you post a query profile from Impala for one of the slow insert jobs?

Mike

On Tue, Jul 31, 2018 at 12:56 PM Tomas Farkas  wrote:

> Hi,
> wanted share with you the preliminary results of my Kudu testing on AWS
> Created a set of performance tests for evaluation of different instance
> types in AWS and different configurations (Kudu separated from Impala, Kudu
> and Impala on the same nodes); different drive (st1 and gp2) settings and
> here my results:
>
> I was quite dissapointed by the inserts in Step3 see attached sqls,
>
> Any hints, ideas, why this does not scale?
> Thanks
>
>
>