[jira] [Commented] (HAWQ-1342) QE process hang in shared input scan on segment node
[ https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880029#comment-15880029 ] Ming LI commented on HAWQ-1342: --- Someone may have question for the behavior of select() error returns, below is the summary: Now the behavior for select() ERRORS is: - On both system: EBADF -- break EINTR -- loop again EINVAL -- programming error, should not occur - On Linux: ENOMEM -- loop again, waiting for runaway to choose one transaction to rollback, or OS choose one process to kill - On macos: EAGAIN -- loop again Conclusion: --- So we just process the EBADF only, others are loop again or impossible to occur. Thanks. > QE process hang in shared input scan on segment node > > > Key: HAWQ-1342 > URL: https://issues.apache.org/jira/browse/HAWQ-1342 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution >Affects Versions: 2.0.0.0-incubating >Reporter: Amy >Assignee: Ming LI > Fix For: backlog > > > QE process hang on some segment node while QD and QE on other segment nodes > terminated. > {code} > [gpadmin@test1 ~]$ cat hostfile > test1 master secondary namenode > test2 segment datanode > test3 segment datanode > test4 segment datanode > test5 segment namenode > [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep > gpadmin 41877 1 0 05:35 ?00:01:04 > /usr/local/hawq_2_1_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 41878 41877 0 05:35 ?00:00:02 postgres: port 20100, > logger process > gpadmin 41881 41877 0 05:35 ?00:00:00 postgres: port 20100, stats > collector process > gpadmin 41882 41877 0 05:35 ?00:00:07 postgres: port 20100, > writer process > gpadmin 41883 41877 0 05:35 ?00:00:01 postgres: port 20100, > checkpoint process > gpadmin 41884 41877 0 05:35 ?00:00:11 postgres: port 20100, > segment resource manager > gpadmin 42108 41877 0 05:35 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 42416 41877 0 05:35 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 44807 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC > SELECT > gpadmin 44819 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC > SELECT > gpadmin 44821 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC > SELECT > gpadmin 45447 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 49859 41877 0 05:38 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC > SELECT > gpadmin 49881 41877 0 05:38 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC > SELECT > gpadmin 51937 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC > SELECT > gpadmin 51939 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 51941 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 51943 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC > SELECT > gpadmin 51953 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC > SELECT > gpadmin 53436 41877 0 05:40 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC > SELECT > gpadmin 57095 41877 0 05:41 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC > SELECT > gpadmin 57097 41877 0 05:41 ?00:00:04 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 63159 41877 0 05:43 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.
[jira] [Commented] (HAWQ-1342) QE process hang in shared input scan on segment node
[ https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879968#comment-15879968 ] Ming LI commented on HAWQ-1342: --- The basic idea for this kinds of hung problem is to: (1) The error thrown segment will invoke rollback the whole transaction, and all related fd will be closed during transaction end. (2) The other segment just act as before, when wait for select(), it will loop until the specific fd is closed, then the code will run until process interrupt (the rollback transaction will send cancel signal) again in other place afterward. So some previous fix (HAWQ-166, HAWQ-1282) will be changed accordingly. (1) HAWQ-166: we don't need to skip sending info (2) HAWQ-1282: - we don't need to close the fd, it will be closed automatically during transaction end. - we just end loop if we find the related FD has already been closed. > QE process hang in shared input scan on segment node > > > Key: HAWQ-1342 > URL: https://issues.apache.org/jira/browse/HAWQ-1342 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution >Affects Versions: 2.0.0.0-incubating >Reporter: Amy >Assignee: Ming LI > Fix For: backlog > > > QE process hang on some segment node while QD and QE on other segment nodes > terminated. > {code} > [gpadmin@test1 ~]$ cat hostfile > test1 master secondary namenode > test2 segment datanode > test3 segment datanode > test4 segment datanode > test5 segment namenode > [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep > gpadmin 41877 1 0 05:35 ?00:01:04 > /usr/local/hawq_2_1_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 41878 41877 0 05:35 ?00:00:02 postgres: port 20100, > logger process > gpadmin 41881 41877 0 05:35 ?00:00:00 postgres: port 20100, stats > collector process > gpadmin 41882 41877 0 05:35 ?00:00:07 postgres: port 20100, > writer process > gpadmin 41883 41877 0 05:35 ?00:00:01 postgres: port 20100, > checkpoint process > gpadmin 41884 41877 0 05:35 ?00:00:11 postgres: port 20100, > segment resource manager > gpadmin 42108 41877 0 05:35 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 42416 41877 0 05:35 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 44807 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC > SELECT > gpadmin 44819 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC > SELECT > gpadmin 44821 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC > SELECT > gpadmin 45447 41877 0 05:36 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 49859 41877 0 05:38 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC > SELECT > gpadmin 49881 41877 0 05:38 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC > SELECT > gpadmin 51937 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC > SELECT > gpadmin 51939 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC > SELECT > gpadmin 51941 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC > SELECT > gpadmin 51943 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC > SELECT > gpadmin 51953 41877 0 05:39 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC > SELECT > gpadmin 53436 41877 0 05:40 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC > SELECT > gpadmin 57095 41877 0 05:41 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC > SELECT > gpadmin 57097 41877 0 05:41 ?00:00:04 postgre