[jira] [Commented] (HAWQ-1183) Writable external table with Hash distribution shows slow performance

2016-12-01 Thread Paul Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711343#comment-15711343
 ] 

Paul Guo commented on HAWQ-1183:


With the previous patch, the planner and run time is as expected now. 

postgres=# explain analyze INSERT INTO ext_tbl1 SELECT * from tbl1;
QUERY PLAN
-
-

Insert (cost=0.00..444.03 rows=167 width=8)
Rows out: Avg 351833.3 rows x 6 workers. Max/Last(seg5:host67/seg0:host67) 
351916/351849 rows with 89/171 ms to first row, 4074/4209 ms to end, start 
offset by 46/45 ms.
Executor memory: 1K bytes avg, 1K bytes max (seg5:host67).
-> Result (cost=0.00..431.01 rows=167 width=20)
Rows out: Avg 351833.3 rows x 6 workers. Max/Last(seg5:host67/seg0:host67) 
351916/351849 rows with 77/148 ms to first row, 292/392 ms to end, start offset 
by 46/45 ms.
-> Table Scan on tbl1 (cost=0.00..431.00 rows=167 width=8)
Rows out: Avg 351833.3 rows x 6 workers. Max/Last(seg5:host67/seg2:host67) 
351916/351855 rows with 77/152 ms to first row, 158/257 ms to end, start offset 
by 46/42
ms.
Slice statistics:
(slice0) Executor memory: 280K bytes avg x 6 workers, 280K bytes max 
(seg5:host67).
Statement statistics:
Memory used: 262144K bytes
Optimizer status: PQO version 1.684
Dispatcher statistics:
executors used(total/cached/new connection): (6/0/6); dispatcher 
time(total/connection/dispatch data): (38.288 ms/37.708 ms/0.078 ms).
dispatch data time(max/min/avg): (0.028 ms/0.004 ms/0.012 ms); consume executor 
data time(max/min/avg): (0.067 ms/0.014 ms/0.029 ms); free executor 
time(max/min/avg): (0.000 ms/0
.000 ms/0.000 ms).
Data locality statistics:
data locality ratio: 1.000; virtual segment number: 6; different host number: 
1; virtual segment number per host(avg/min/max): (6/6/6); segment 
size(avg/min/max): (7670609.333 B/
7668464 B/7672344 B); segment size with penalty(avg/min/max): (0.000 B/0 B/0 
B); continuity(avg/min/max): (1.000/1.000/1.000); DFS metadatacache: 28.855 ms; 
resource allocation: 12.
933 ms; datalocality calculation: 0.190 ms.
Total runtime: 4333.663 ms
(18 rows)


> Writable external table with Hash distribution shows slow performance
> -
>
> Key: HAWQ-1183
> URL: https://issues.apache.org/jira/browse/HAWQ-1183
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Paul Guo
>Assignee: Paul Guo
>
> Steps:
> 1. Create tables and populate them.
> drop table tbl1;
> drop external table ext_tbl1;
> drop external table ext_tbl1_random;
> CREATE TABLE tbl1 (a int, b text) DISTRIBUTED BY (a);
> INSERT INTO tbl1 VALUES (generate_series(1,1000),'aaa');
> INSERT INTO tbl1 VALUES (generate_series(1,1),'bbb');
> INSERT INTO tbl1 VALUES (generate_series(1,10),'bbc');
> INSERT INTO tbl1 VALUES (generate_series(1,100),'bdbc');
> INSERT INTO tbl1 VALUES (generate_series(1,100),'bdddbc');
> CREATE WRITABLE EXTERNAL TABLE ext_tbl1
> ( LIKE tbl1 )
> LOCATION ('gpfdist://127.0.0.1/tbl1.csv')
> FORMAT 'CSV' (DELIMITER ',')
> DISTRIBUTED BY (a);
> CREATE WRITABLE EXTERNAL TABLE ext_tbl1_random
> ( LIKE tbl1 )
> LOCATION ('gpfdist://127.0.0.1/tbl1.random.csv')
> FORMAT 'CSV' (DELIMITER ',')
> DISTRIBUTED RANDOMLY;
> 2. Write the two external tables. We can find that the external table with 
> hash distribution is slow with inserting, and plan shows that it has 1 
> workers only.
> postgres=# explain analyze INSERT INTO ext_tbl1 SELECT * from tbl1;
>  QUERY PLAN
> -
> -
> 
>  Insert  (cost=0.00..509.20 rows=1000 width=8)
>Rows out:  Avg 2111000.0 rows x 1 workers.  
> Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 17/17 ms to first 
> row, 20145/20145 ms to end, start offset by 18/18 ms.
>Executor memory:  1K bytes.
>->  Result  (cost=0.00..431.07 rows=1000 width=20)
>  Rows out:  Avg 2111000.0 rows x 1 workers.  
> Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 14/14 ms to first 
> row, 1919/1919 ms to end, start offset by 18/18 ms
> .
>  ->  Redistribute 

[jira] [Commented] (HAWQ-1183) Writable external table with Hash distribution shows slow performance

2016-12-01 Thread Paul Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15711332#comment-15711332
 ] 

Paul Guo commented on HAWQ-1183:


16559 and 16561 are oid of the two external tables, one for hash and another 
for random.

postgres=# select * from gp_distribution_policy;
 localoid | bucketnum | attrnums
--+---+--
16554 | 6 | {1}
16559 | 1 | {1}
16561 | 1 |
(3 rows)

Looking into DefineExternalRelation(), it looks like for EXTTBL_TYPE_LOCATION, 
it set the bucket number as the (for our case is gpfdist) location number
createStmt->policy->bucketnum = locLength;

I talked with related designer, this seems to be a hack. In theory we should 
save location number and bucket number in different places in catalog tables.

In short term, we could fix this soon with the patch below,
@@ -970,7 +970,7 @@ DefineExternalRelation(CreateExternalStmt *createExtStmt)
 isweb, iswritable,);
if(!isCustom){
int locLength = list_length(exttypeDesc->location_list);
-   if (createStmt->policy && locLength > 0)
+   if (createStmt->policy && locLength > 0 && locLength > 
createStmt->policy->bucketnum)
{
createStmt->policy->bucketnum = locLength;
}

In the long run, we should save bucket number and location number in different 
place.

> Writable external table with Hash distribution shows slow performance
> -
>
> Key: HAWQ-1183
> URL: https://issues.apache.org/jira/browse/HAWQ-1183
> Project: Apache HAWQ
>  Issue Type: Bug
>Reporter: Paul Guo
>Assignee: Paul Guo
>
> Steps:
> 1. Create tables and populate them.
> drop table tbl1;
> drop external table ext_tbl1;
> drop external table ext_tbl1_random;
> CREATE TABLE tbl1 (a int, b text) DISTRIBUTED BY (a);
> INSERT INTO tbl1 VALUES (generate_series(1,1000),'aaa');
> INSERT INTO tbl1 VALUES (generate_series(1,1),'bbb');
> INSERT INTO tbl1 VALUES (generate_series(1,10),'bbc');
> INSERT INTO tbl1 VALUES (generate_series(1,100),'bdbc');
> INSERT INTO tbl1 VALUES (generate_series(1,100),'bdddbc');
> CREATE WRITABLE EXTERNAL TABLE ext_tbl1
> ( LIKE tbl1 )
> LOCATION ('gpfdist://127.0.0.1/tbl1.csv')
> FORMAT 'CSV' (DELIMITER ',')
> DISTRIBUTED BY (a);
> CREATE WRITABLE EXTERNAL TABLE ext_tbl1_random
> ( LIKE tbl1 )
> LOCATION ('gpfdist://127.0.0.1/tbl1.random.csv')
> FORMAT 'CSV' (DELIMITER ',')
> DISTRIBUTED RANDOMLY;
> 2. Write the two external tables. We can find that the external table with 
> hash distribution is slow with inserting, and plan shows that it has 1 
> workers only.
> postgres=# explain analyze INSERT INTO ext_tbl1 SELECT * from tbl1;
> QUERY PLAN
> -
> -
> ---
>  Insert  (cost=0.00..509.20 rows=1000 width=8)
>Rows out:  Avg 2111000.0 rows x 1 workers.  
> Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 70/70 ms to first 
> row, 20304/20304 ms to end, start offset by 30/30 ms.
>Executor memory:  1K bytes.
>->  Result  (cost=0.00..431.07 rows=1000 width=20)
>  Rows out:  Avg 2111000.0 rows x 1 workers.  
> Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 61/61 ms to first 
> row, 2034/2034 ms to end, start offset by 30/30 ms
> .
>  ->  Redistribute Motion 1:1  (slice1; segments: 1)  
> (cost=0.00..431.05 rows=1000 width=8)
>Hash Key: tbl1.a
>Rows out:  Avg 2111000.0 rows x 1 workers at destination.  
> Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 61/61 ms to first 
> row, 1370/1370 ms to end, sta
> rt offset by 30/30 ms.
>->  Table Scan on tbl1  (cost=0.00..431.01 rows=1000 width=8)
>  Rows out:  Avg 2111000.0 rows x 1 workers.  
> Max/Last(seg0:host67/seg0:host67) 2111000/2111000 rows with 61/61 ms to first 
> row, 566/566 ms to end, start offset b
> y 30/30 ms.
>  Slice statistics:
>(slice0)Executor memory: 293K bytes (seg0:host67).
>(slice1)Executor memory: 303K bytes (seg0:host67).
>  Statement statistics:
>Memory used: 262144K bytes
>  Optimizer status: PQO version 1.684
>  Dispatcher statistics:
>executors used(total/cached/new connection): (2/0/2); dispatcher 
> time(total/connection/dispatch data): (17.095 ms/16.477 ms/0.053