[jira] [Assigned] (HAWQ-1342) QE process hang in shared input scan on segment node
[ https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming LI reassigned HAWQ-1342: - Assignee: Ming LI (was: Amy) > QE process hang in shared input scan on segment node > > > Key: HAWQ-1342 > URL: https://issues.apache.org/jira/browse/HAWQ-1342 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution >Affects Versions: 2.0.0.0-incubating >Reporter: Amy >Assignee: Ming LI > Fix For: backlog > > > QE process hang on some segment node while QD and QE on other segment nodes > terminated. > {code} > [gpadmin@test1 ~]$ cat hostfile > test1 master secondary namenode > test2 segment datanode > test3 segment datanode > test4 segment datanode > test5 segment namenode > [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep > gpadmin 41877 1 0 05:35 ?00:01:04 > /usr/local/hawq_2_1_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 41878 41877 0 05:35 ?00:00:02 postgres: port 20100, > logger process > gpadmin 41881 41877 0 05:35 ?00:00:00 postgres: port 20100, stats > collector process > gpadmin 41882 41877 0 05:35 ?00:00:07 postgres: port 20100, > writer process > gpadmin 41883 41877 0 05:35 ?00:00:01 postgres: port 20100, > checkpoint process > gpadmin 41884 41877 0 05:35 ?00:00:11 postgres: port 20100, > segment resource manager > gpadmin 42108 41877 0 05:35 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 42416 41877 0 05:35 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 44807 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC > SELECT > gpadmin 44819 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC > SELECT > gpadmin 44821 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC > SELECT > gpadmin 45447 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 49859 41877 0 05:38 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC > SELECT > gpadmin 49881 41877 0 05:38 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC > SELECT > gpadmin 51937 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC > SELECT > gpadmin 51939 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 51941 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 51943 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC > SELECT > gpadmin 51953 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC > SELECT > gpadmin 53436 41877 0 05:40 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC > SELECT > gpadmin 57095 41877 0 05:41 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC > SELECT > gpadmin 57097 41877 0 05:41 ?00:00:04 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 63159 41877 0 05:43 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 > MPPEXEC SELECT > gpadmin 64018 41877 0 05:44 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC > SELECT > {code} > The stack info is as below and it seems that QE hang in shared input scan. > {code} > [gpadmin@test3 ~]$ gdb -p 42108 > (gdb) info threads > 2 Thread 0x7f4f6b335700 (LWP 42109) 0x0032214df283 in poll () from > /lib64/libc.so.6 > * 1 Thread 0x7f4f9041c920 (LWP 42108) 0x0032214e1523 in sele
[jira] [Assigned] (HAWQ-1342) QE process hang in shared input scan on segment node
[ https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amy reassigned HAWQ-1342: - Assignee: Amy (was: Lei Chang) > QE process hang in shared input scan on segment node > > > Key: HAWQ-1342 > URL: https://issues.apache.org/jira/browse/HAWQ-1342 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution >Reporter: Amy >Assignee: Amy > Fix For: 2.3.0.0-incubating > > > QE process hang on some segment node while QD and QE on other segment nodes > terminated. > {code} > [gpadmin@test1 ~]$ cat hostfile > test1 master secondary namenode > test2 segment datanode > test3 segment datanode > test4 segment datanode > test5 segment namenode > [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep > gpadmin 41877 1 0 05:35 ?00:01:04 > /usr/local/hawq_2_1_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 41878 41877 0 05:35 ?00:00:02 postgres: port 20100, > logger process > gpadmin 41881 41877 0 05:35 ?00:00:00 postgres: port 20100, stats > collector process > gpadmin 41882 41877 0 05:35 ?00:00:07 postgres: port 20100, > writer process > gpadmin 41883 41877 0 05:35 ?00:00:01 postgres: port 20100, > checkpoint process > gpadmin 41884 41877 0 05:35 ?00:00:11 postgres: port 20100, > segment resource manager > gpadmin 42108 41877 0 05:35 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 42416 41877 0 05:35 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 44807 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC > SELECT > gpadmin 44819 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC > SELECT > gpadmin 44821 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC > SELECT > gpadmin 45447 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 49859 41877 0 05:38 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC > SELECT > gpadmin 49881 41877 0 05:38 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC > SELECT > gpadmin 51937 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC > SELECT > gpadmin 51939 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 51941 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 51943 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC > SELECT > gpadmin 51953 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC > SELECT > gpadmin 53436 41877 0 05:40 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC > SELECT > gpadmin 57095 41877 0 05:41 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC > SELECT > gpadmin 57097 41877 0 05:41 ?00:00:04 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 63159 41877 0 05:43 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 > MPPEXEC SELECT > gpadmin 64018 41877 0 05:44 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC > SELECT > {code} > The stack info is as below and it seems that QE hang in shared input scan. > {code} > [gpadmin@test3 ~]$ gdb -p 42108 > (gdb) info threads > 2 Thread 0x7f4f6b335700 (LWP 42109) 0x0032214df283 in poll () from > /lib64/libc.so.6 > * 1 Thread 0x7f4f9041c920 (LWP 42108) 0x0032214e1523 in select () from > /lib64/libc.so.6 > (gdb) th