[jira] [Commented] (HAWQ-1183) Writable external table with Hash distribution shows slow performance
[ https://issues.apache.org/jira/browse/HAWQ-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711343#comment-15711343 ] Paul Guo commented on HAWQ-1183: With the previous patch, the planner and run time is as expected now. postgres=# explain analyze INSERT INTO ext_tbl1 SELECT * from tbl1; QUERY PLAN - - Insert (cost=0.00..444.03 rows=167 width=8) Rows out: Avg 351833.3 rows x 6 workers. Max/Last(seg5:host67/seg0:host67) 351916/351849 rows with 89/171 ms to first row, 4074/4209 ms to end, start offset by 46/45 ms. Executor memory: 1K bytes avg, 1K bytes max (seg5:host67). -> Result (cost=0.00..431.01 rows=167 width=20) Rows out: Avg 351833.3 rows x 6 workers. Max/Last(seg5:host67/seg0:host67) 351916/351849 rows with 77/148 ms to first row, 292/392 ms to end, start offset by 46/45 ms. -> Table Scan on tbl1 (cost=0.00..431.00 rows=167 width=8) Rows out: Avg 351833.3 rows x 6 workers. Max/Last(seg5:host67/seg2:host67) 351916/351855 rows with 77/152 ms to first row, 158/257 ms to end, start offset by 46/42 ms. Slice statistics: (slice0) Executor memory: 280K bytes avg x 6 workers, 280K bytes max (seg5:host67). Statement statistics: Memory used: 262144K bytes Optimizer status: PQO version 1.684 Dispatcher statistics: executors used(total/cached/new connection): (6/0/6); dispatcher time(total/connection/dispatch data): (38.288 ms/37.708 ms/0.078 ms). dispatch data time(max/min/avg): (0.028 ms/0.004 ms/0.012 ms); consume executor data time(max/min/avg): (0.067 ms/0.014 ms/0.029 ms); free executor time(max/min/avg): (0.000 ms/0 .000 ms/0.000 ms). Data locality statistics: data locality ratio: 1.000; virtual segment number: 6; different host number: 1; virtual segment number per host(avg/min/max): (6/6/6); segment size(avg/min/max): (7670609.333 B/ 7668464 B/7672344 B); segment size with penalty(avg/min/max): (0.000 B/0 B/0 B); continuity(avg/min/max): (1.000/1.000/1.000); DFS metadatacache: 28.855 ms; resource allocation: 12. 933 ms; datalocality calculation: 0.190 ms. Total runtime: 4333.663 ms (18 rows) > Writable external table with Hash distribution shows slow performance > - > > Key: HAWQ-1183 > URL: https://issues.apache.org/jira/browse/HAWQ-1183 > Project: Apache HAWQ > Issue Type: Bug >Reporter: Paul Guo >Assignee: Paul Guo > > Steps: > 1. Create tables and populate them. > drop table tbl1; > drop external table ext_tbl1; > drop external table ext_tbl1_random; > CREATE TABLE tbl1 (a int, b text) DISTRIBUTED BY (a); > INSERT INTO tbl1 VALUES (generate_series(1,1000),'aaa'); > INSERT INTO tbl1 VALUES (generate_series(1,1),'bbb'); > INSERT INTO tbl1 VALUES (generate_series(1,10),'bbc'); > INSERT INTO tbl1 VALUES (generate_series(1,100),'bdbc'); > INSERT INTO tbl1 VALUES (generate_series(1,100),'bdddbc'); > CREATE WRITABLE EXTERNAL TABLE ext_tbl1 > ( LIKE tbl1 ) > LOCATION ('gpfdist://127.0.0.1/tbl1.csv') > FORMAT 'CSV' (DELIMITER ',') > DISTRIBUTED BY (a); > CREATE WRITABLE EXTERNAL TABLE ext_tbl1_random > ( LIKE tbl1 ) > LOCATION ('gpfdist://127.0.0.1/tbl1.random.csv') > FORMAT 'CSV' (DELIMITER ',') > DISTRIBUTED RANDOMLY; > 2. Write the two external tables. We can find that the external table with > hash distribution is slow with inserting, and plan shows that it has 1 > workers only. > postgres=# explain analyze INSERT INTO ext_tbl1 SELECT * from tbl1; > QUERY PLAN > - > - > > Insert (cost=0.00..509.20 rows=1000 width=8) >Rows out: Avg 2111000.0 rows x 1 workers. > Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 17/17 ms to first > row, 20145/20145 ms to end, start offset by 18/18 ms. >Executor memory: 1K bytes. >-> Result (cost=0.00..431.07 rows=1000 width=20) > Rows out: Avg 2111000.0 rows x 1 workers. > Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 14/14 ms to first > row, 1919/1919 ms to end, start offset by 18/18 ms > . > -> Redistribute
[jira] [Commented] (HAWQ-1183) Writable external table with Hash distribution shows slow performance
[ https://issues.apache.org/jira/browse/HAWQ-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711332#comment-15711332 ] Paul Guo commented on HAWQ-1183: 16559 and 16561 are oid of the two external tables, one for hash and another for random. postgres=# select * from gp_distribution_policy; localoid | bucketnum | attrnums --+---+-- 16554 | 6 | {1} 16559 | 1 | {1} 16561 | 1 | (3 rows) Looking into DefineExternalRelation(), it looks like for EXTTBL_TYPE_LOCATION, it set the bucket number as the (for our case is gpfdist) location number createStmt->policy->bucketnum = locLength; I talked with related designer, this seems to be a hack. In theory we should save location number and bucket number in different places in catalog tables. In short term, we could fix this soon with the patch below, @@ -970,7 +970,7 @@ DefineExternalRelation(CreateExternalStmt *createExtStmt) isweb, iswritable,); if(!isCustom){ int locLength = list_length(exttypeDesc->location_list); - if (createStmt->policy && locLength > 0) + if (createStmt->policy && locLength > 0 && locLength > createStmt->policy->bucketnum) { createStmt->policy->bucketnum = locLength; } In the long run, we should save bucket number and location number in different place. > Writable external table with Hash distribution shows slow performance > - > > Key: HAWQ-1183 > URL: https://issues.apache.org/jira/browse/HAWQ-1183 > Project: Apache HAWQ > Issue Type: Bug >Reporter: Paul Guo >Assignee: Paul Guo > > Steps: > 1. Create tables and populate them. > drop table tbl1; > drop external table ext_tbl1; > drop external table ext_tbl1_random; > CREATE TABLE tbl1 (a int, b text) DISTRIBUTED BY (a); > INSERT INTO tbl1 VALUES (generate_series(1,1000),'aaa'); > INSERT INTO tbl1 VALUES (generate_series(1,1),'bbb'); > INSERT INTO tbl1 VALUES (generate_series(1,10),'bbc'); > INSERT INTO tbl1 VALUES (generate_series(1,100),'bdbc'); > INSERT INTO tbl1 VALUES (generate_series(1,100),'bdddbc'); > CREATE WRITABLE EXTERNAL TABLE ext_tbl1 > ( LIKE tbl1 ) > LOCATION ('gpfdist://127.0.0.1/tbl1.csv') > FORMAT 'CSV' (DELIMITER ',') > DISTRIBUTED BY (a); > CREATE WRITABLE EXTERNAL TABLE ext_tbl1_random > ( LIKE tbl1 ) > LOCATION ('gpfdist://127.0.0.1/tbl1.random.csv') > FORMAT 'CSV' (DELIMITER ',') > DISTRIBUTED RANDOMLY; > 2. Write the two external tables. We can find that the external table with > hash distribution is slow with inserting, and plan shows that it has 1 > workers only. > postgres=# explain analyze INSERT INTO ext_tbl1 SELECT * from tbl1; > QUERY PLAN > - > - > --- > Insert (cost=0.00..509.20 rows=1000 width=8) >Rows out: Avg 2111000.0 rows x 1 workers. > Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 70/70 ms to first > row, 20304/20304 ms to end, start offset by 30/30 ms. >Executor memory: 1K bytes. >-> Result (cost=0.00..431.07 rows=1000 width=20) > Rows out: Avg 2111000.0 rows x 1 workers. > Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 61/61 ms to first > row, 2034/2034 ms to end, start offset by 30/30 ms > . > -> Redistribute Motion 1:1 (slice1; segments: 1) > (cost=0.00..431.05 rows=1000 width=8) >Hash Key: tbl1.a >Rows out: Avg 2111000.0 rows x 1 workers at destination. > Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 61/61 ms to first > row, 1370/1370 ms to end, sta > rt offset by 30/30 ms. >-> Table Scan on tbl1 (cost=0.00..431.01 rows=1000 width=8) > Rows out: Avg 2111000.0 rows x 1 workers. > Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 61/61 ms to first > row, 566/566 ms to end, start offset b > y 30/30 ms. > Slice statistics: >(slice0)Executor memory: 293K bytes (seg0:host67). >(slice1)Executor memory: 303K bytes (seg0:host67). > Statement statistics: >Memory used: 262144K bytes > Optimizer status: PQO version 1.684 > Dispatcher statistics: >executors used(total/cached/new connection): (2/0/2); dispatcher > time(total/connection/dispatch data): (17.095 ms/16.477 ms/0.053