[jira] [Assigned] (HAWQ-1342) QE process hang in shared input scan on segment node

2017-02-22 Thread Ming LI (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming LI reassigned HAWQ-1342:
-

Assignee: Ming LI  (was: Amy)

> QE process hang in shared input scan on segment node
> 
>
> Key: HAWQ-1342
> URL: https://issues.apache.org/jira/browse/HAWQ-1342
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Affects Versions: 2.0.0.0-incubating
>Reporter: Amy
>Assignee: Ming LI
> Fix For: backlog
>
>
> QE process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877  1  0 05:35 ?00:01:04 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?00:00:02 postgres: port 20100, 
> logger process
> gpadmin   41881  41877  0 05:35 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   41882  41877  0 05:35 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   41883  41877  0 05:35 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   41884  41877  0 05:35 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin   42108  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   42416  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   44807  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC 
> SELECT
> gpadmin   44819  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   44821  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC 
> SELECT
> gpadmin   45447  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   49859  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC 
> SELECT
> gpadmin   49881  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51937  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51939  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   51941  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   51943  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC 
> SELECT
> gpadmin   51953  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC 
> SELECT
> gpadmin   53436  41877  0 05:40 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC 
> SELECT
> gpadmin   57095  41877  0 05:41 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   57097  41877  0 05:41 ?00:00:04 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   63159  41877  0 05:43 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 
> MPPEXEC SELECT
> gpadmin   64018  41877  0 05:44 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC 
> SELECT
> {code}
> The stack info is as below and it seems that QE hang in shared input scan.
> {code}
> [gpadmin@test3 ~]$ gdb -p 42108
> (gdb) info threads
>   2 Thread 0x7f4f6b335700 (LWP 42109)  0x0032214df283 in poll () from 
> /lib64/libc.so.6
> * 1 Thread 0x7f4f9041c920 (LWP 42108)  0x0032214e1523 in sele

[jira] [Assigned] (HAWQ-1342) QE process hang in shared input scan on segment node

2017-02-17 Thread Amy (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amy reassigned HAWQ-1342:
-

Assignee: Amy  (was: Lei Chang)

> QE process hang in shared input scan on segment node
> 
>
> Key: HAWQ-1342
> URL: https://issues.apache.org/jira/browse/HAWQ-1342
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Reporter: Amy
>Assignee: Amy
> Fix For: 2.3.0.0-incubating
>
>
> QE process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877  1  0 05:35 ?00:01:04 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?00:00:02 postgres: port 20100, 
> logger process
> gpadmin   41881  41877  0 05:35 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   41882  41877  0 05:35 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   41883  41877  0 05:35 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   41884  41877  0 05:35 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin   42108  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   42416  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   44807  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC 
> SELECT
> gpadmin   44819  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   44821  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC 
> SELECT
> gpadmin   45447  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   49859  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC 
> SELECT
> gpadmin   49881  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51937  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51939  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   51941  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   51943  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC 
> SELECT
> gpadmin   51953  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC 
> SELECT
> gpadmin   53436  41877  0 05:40 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC 
> SELECT
> gpadmin   57095  41877  0 05:41 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   57097  41877  0 05:41 ?00:00:04 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   63159  41877  0 05:43 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 
> MPPEXEC SELECT
> gpadmin   64018  41877  0 05:44 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC 
> SELECT
> {code}
> The stack info is as below and it seems that QE hang in shared input scan.
> {code}
> [gpadmin@test3 ~]$ gdb -p 42108
> (gdb) info threads
>   2 Thread 0x7f4f6b335700 (LWP 42109)  0x0032214df283 in poll () from 
> /lib64/libc.so.6
> * 1 Thread 0x7f4f9041c920 (LWP 42108)  0x0032214e1523 in select () from 
> /lib64/libc.so.6
> (gdb) th