[jira] [Commented] (HAWQ-1342) QE process hang in shared input scan on segment node

2017-02-22 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880029#comment-15880029
 ] 

Ming LI commented on HAWQ-1342:
---

Someone may have question for the behavior of select() error returns, below is 
the summary:

Now the behavior for select() ERRORS is:
- On both system:
EBADF  -- break
EINTR   -- loop again
EINVAL -- programming error, should not occur

- On Linux:
ENOMEM -- loop again, waiting for runaway to choose one transaction to 
rollback, or OS choose one process to kill

- On macos:
EAGAIN -- loop again

Conclusion: 
---
So we just process the EBADF only, others are loop again or impossible to 
occur. Thanks.


> QE process hang in shared input scan on segment node
> 
>
> Key: HAWQ-1342
> URL: https://issues.apache.org/jira/browse/HAWQ-1342
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Affects Versions: 2.0.0.0-incubating
>Reporter: Amy
>Assignee: Ming LI
> Fix For: backlog
>
>
> QE process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877  1  0 05:35 ?00:01:04 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?00:00:02 postgres: port 20100, 
> logger process
> gpadmin   41881  41877  0 05:35 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   41882  41877  0 05:35 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   41883  41877  0 05:35 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   41884  41877  0 05:35 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin   42108  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   42416  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   44807  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC 
> SELECT
> gpadmin   44819  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   44821  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC 
> SELECT
> gpadmin   45447  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   49859  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC 
> SELECT
> gpadmin   49881  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51937  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51939  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   51941  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   51943  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC 
> SELECT
> gpadmin   51953  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC 
> SELECT
> gpadmin   53436  41877  0 05:40 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC 
> SELECT
> gpadmin   57095  41877  0 05:41 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   57097  41877  0 05:41 ?00:00:04 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   63159  41877  0 05:43 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.

[jira] [Commented] (HAWQ-1342) QE process hang in shared input scan on segment node

2017-02-22 Thread Ming LI (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879968#comment-15879968
 ] 

Ming LI commented on HAWQ-1342:
---

The basic idea for this kinds of hung problem is to:
(1) The error thrown segment will invoke rollback the whole transaction, and 
all related fd will be closed during transaction end.
(2) The other segment just act as before, when wait for select(), it will loop 
until the specific fd is closed, then the code will run until process interrupt 
(the rollback transaction will send cancel signal) again in other place 
afterward.

So some previous fix (HAWQ-166,  HAWQ-1282) will be changed accordingly.
(1) HAWQ-166: we don't need to skip sending info
(2) HAWQ-1282:
  - we don't need to close the fd, it will be closed automatically during 
transaction end.
  - we just end loop if we find the related FD has already been closed.

> QE process hang in shared input scan on segment node
> 
>
> Key: HAWQ-1342
> URL: https://issues.apache.org/jira/browse/HAWQ-1342
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Affects Versions: 2.0.0.0-incubating
>Reporter: Amy
>Assignee: Ming LI
> Fix For: backlog
>
>
> QE process hang on some segment node while QD and QE on other segment nodes 
> terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877  1  0 05:35 ?00:01:04 
> /usr/local/hawq_2_1_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?00:00:02 postgres: port 20100, 
> logger process
> gpadmin   41881  41877  0 05:35 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   41882  41877  0 05:35 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   41883  41877  0 05:35 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   41884  41877  0 05:35 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin   42108  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   42416  41877  0 05:35 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   44807  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC 
> SELECT
> gpadmin   44819  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   44821  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC 
> SELECT
> gpadmin   45447  41877  0 05:36 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   49859  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC 
> SELECT
> gpadmin   49881  41877  0 05:38 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51937  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC 
> SELECT
> gpadmin   51939  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC 
> SELECT
> gpadmin   51941  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC 
> SELECT
> gpadmin   51943  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC 
> SELECT
> gpadmin   51953  41877  0 05:39 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC 
> SELECT
> gpadmin   53436  41877  0 05:40 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC 
> SELECT
> gpadmin   57095  41877  0 05:41 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC 
> SELECT
> gpadmin   57097  41877  0 05:41 ?00:00:04 postgre