[jira] [Updated] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling

2017-06-15 Thread Ruilong Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruilong Huo updated HAWQ-1487:
--
Affects Version/s: 2.2.0.0-incubating

> hang process due to deadlock when it try to process interrupt in error 
> handling
> ---
>
> Key: HAWQ-1487
> URL: https://issues.apache.org/jira/browse/HAWQ-1487
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Affects Versions: 2.2.0.0-incubating
>Reporter: Ruilong Huo
>Assignee: Ruilong Huo
> Fix For: 2.3.0.0-incubating
>
>
> It has hang process when it try to process interrupt in error handling. To be 
> specific, some QE encounter division by zero error, and then it error out. 
> During the error processing, it try to handle query cancelling interrupt and 
> thus deadlock occur.
> The hang process is:
> {noformat}
> $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep"
> gpadmin   51246  51245  0 06:15 ?00:00:01 postgres: port 20100, 
> logger p
> gpadmin   51249  51245  0 06:15 ?00:00:00 postgres: port 20100, stats 
> co
> gpadmin   51250  51245  0 06:15 ?00:00:07 postgres: port 20100, 
> writer p
> gpadmin   51251  51245  0 06:15 ?00:00:01 postgres: port 20100, 
> checkpoi
> gpadmin   51252  51245  0 06:15 ?00:00:11 postgres: port 20100, 
> segment
> gpadmin  182983  51245  0 07:00 ?00:00:03 postgres: port 20100, 
> hawqsupe
> $ ps -ef | grep postgres | grep -v grep
> gpadmin   51245  1  0 06:15 ?00:01:01 
> /usr/local/hawq_2_2_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   51246  51245  0 06:15 ?00:00:01 postgres: port 20100, 
> logger process
> gpadmin   51249  51245  0 06:15 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   51250  51245  0 06:15 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   51251  51245  0 06:15 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   51252  51245  0 06:15 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin  182983  51245  0 07:00 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 
> MPPEXEC SELECT
> gpadmin  194424 194402  0 23:50 pts/000:00:00 grep postgres
> {noformat}
> The call stack is:
> {noformat}
> $ sudo gdb -p 182983
> (gdb) bt
> #0  0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
> #4  0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1
> #5  0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1
> #6  0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
> #7  0x003ff02fe966 in backtrace () from /lib64/libc.so.6
> #8  0x009cda3f in errstart (elevel=20, filename=0xd309e0 
> "postgres.c", lineno=3618,
> funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492
> #9  0x008e8fcb in ProcessInterrupts () at postgres.c:3616
> #10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at 
> postgres.c:3463
> #11 
> #12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
> #14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1
> #15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
> #16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6
> #17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", 
> lineno=839, funcname=0xd3bf3a "float8div",
> domain=0x0) at elog.c:492
> #18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836
> #19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, 
> econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030",
> isDone=0x7ffd04d2bd04) at execQual.c:1762
> #20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, 
> isNull=0x7ffd04d2c0e0 "\030",
> isDone=0x7ffd04d2bd04) at execQual.c:2250
> #21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, 
> argList=0x324b378, econtext=0x32495d8) at execQual.c:1317
> #22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, 
> econtext=0x32495d8,
> isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at 
> execQual.c:1532
> #23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, 
> isNull=0x7ffd04d2c5c1 "\306\322\004\375\177",
> isDone=0x0) at 

[jira] [Updated] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling

2017-06-15 Thread Ruilong Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruilong Huo updated HAWQ-1487:
--
Fix Version/s: 2.3.0.0-incubating

> hang process due to deadlock when it try to process interrupt in error 
> handling
> ---
>
> Key: HAWQ-1487
> URL: https://issues.apache.org/jira/browse/HAWQ-1487
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Query Execution
>Reporter: Ruilong Huo
>Assignee: Ruilong Huo
> Fix For: 2.3.0.0-incubating
>
>
> It has hang process when it try to process interrupt in error handling. To be 
> specific, some QE encounter division by zero error, and then it error out. 
> During the error processing, it try to handle query cancelling interrupt and 
> thus deadlock occur.
> The hang process is:
> {noformat}
> $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep"
> gpadmin   51246  51245  0 06:15 ?00:00:01 postgres: port 20100, 
> logger p
> gpadmin   51249  51245  0 06:15 ?00:00:00 postgres: port 20100, stats 
> co
> gpadmin   51250  51245  0 06:15 ?00:00:07 postgres: port 20100, 
> writer p
> gpadmin   51251  51245  0 06:15 ?00:00:01 postgres: port 20100, 
> checkpoi
> gpadmin   51252  51245  0 06:15 ?00:00:11 postgres: port 20100, 
> segment
> gpadmin  182983  51245  0 07:00 ?00:00:03 postgres: port 20100, 
> hawqsupe
> $ ps -ef | grep postgres | grep -v grep
> gpadmin   51245  1  0 06:15 ?00:01:01 
> /usr/local/hawq_2_2_0_0/bin/postgres -D 
> /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
>  -i -M segment -p 20100 --silent-mode=true
> gpadmin   51246  51245  0 06:15 ?00:00:01 postgres: port 20100, 
> logger process
> gpadmin   51249  51245  0 06:15 ?00:00:00 postgres: port 20100, stats 
> collector process
> gpadmin   51250  51245  0 06:15 ?00:00:07 postgres: port 20100, 
> writer process
> gpadmin   51251  51245  0 06:15 ?00:00:01 postgres: port 20100, 
> checkpoint process
> gpadmin   51252  51245  0 06:15 ?00:00:11 postgres: port 20100, 
> segment resource manager
> gpadmin  182983  51245  0 07:00 ?00:00:03 postgres: port 20100, 
> hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 
> MPPEXEC SELECT
> gpadmin  194424 194402  0 23:50 pts/000:00:00 grep postgres
> {noformat}
> The call stack is:
> {noformat}
> $ sudo gdb -p 182983
> (gdb) bt
> #0  0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
> #4  0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1
> #5  0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1
> #6  0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
> #7  0x003ff02fe966 in backtrace () from /lib64/libc.so.6
> #8  0x009cda3f in errstart (elevel=20, filename=0xd309e0 
> "postgres.c", lineno=3618,
> funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492
> #9  0x008e8fcb in ProcessInterrupts () at postgres.c:3616
> #10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at 
> postgres.c:3463
> #11 
> #12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
> #14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1
> #15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
> #16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6
> #17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", 
> lineno=839, funcname=0xd3bf3a "float8div",
> domain=0x0) at elog.c:492
> #18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836
> #19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, 
> econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030",
> isDone=0x7ffd04d2bd04) at execQual.c:1762
> #20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, 
> isNull=0x7ffd04d2c0e0 "\030",
> isDone=0x7ffd04d2bd04) at execQual.c:2250
> #21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, 
> argList=0x324b378, econtext=0x32495d8) at execQual.c:1317
> #22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, 
> econtext=0x32495d8,
> isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at 
> execQual.c:1532
> #23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, 
> isNull=0x7ffd04d2c5c1 "\306\322\004\375\177",
> isDone=0x0) at execQual.c:2228
> #24 0x0076eed2 in initFcinfo 

[jira] [Updated] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling

2017-06-15 Thread Ruilong Huo (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruilong Huo updated HAWQ-1487:
--
Description: 
It has hang process when it try to process interrupt in error handling. To be 
specific, some QE encounter division by zero error, and then it error out. 
During the error processing, it try to handle query cancelling interrupt and 
thus deadlock occur.

The hang process is:
{noformat}
$ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep"
gpadmin   51246  51245  0 06:15 ?00:00:01 postgres: port 20100, logger p
gpadmin   51249  51245  0 06:15 ?00:00:00 postgres: port 20100, stats co
gpadmin   51250  51245  0 06:15 ?00:00:07 postgres: port 20100, writer p
gpadmin   51251  51245  0 06:15 ?00:00:01 postgres: port 20100, checkpoi
gpadmin   51252  51245  0 06:15 ?00:00:11 postgres: port 20100, segment
gpadmin  182983  51245  0 07:00 ?00:00:03 postgres: port 20100, hawqsupe

$ ps -ef | grep postgres | grep -v grep
gpadmin   51245  1  0 06:15 ?00:01:01 
/usr/local/hawq_2_2_0_0/bin/postgres -D 
/data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd
 -i -M segment -p 20100 --silent-mode=true
gpadmin   51246  51245  0 06:15 ?00:00:01 postgres: port 20100, logger 
process
gpadmin   51249  51245  0 06:15 ?00:00:00 postgres: port 20100, stats 
collector process
gpadmin   51250  51245  0 06:15 ?00:00:07 postgres: port 20100, writer 
process
gpadmin   51251  51245  0 06:15 ?00:00:01 postgres: port 20100, 
checkpoint process
gpadmin   51252  51245  0 06:15 ?00:00:11 postgres: port 20100, segment 
resource manager
gpadmin  182983  51245  0 07:00 ?00:00:03 postgres: port 20100, 
hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 
MPPEXEC SELECT
gpadmin  194424 194402  0 23:50 pts/000:00:00 grep postgres
{noformat}

The call stack is:
{noformat}
$ sudo gdb -p 182983
(gdb) bt
#0  0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#4  0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1
#5  0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1
#6  0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#7  0x003ff02fe966 in backtrace () from /lib64/libc.so.6
#8  0x009cda3f in errstart (elevel=20, filename=0xd309e0 "postgres.c", 
lineno=3618,
funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492
#9  0x008e8fcb in ProcessInterrupts () at postgres.c:3616
#10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at 
postgres.c:3463
#11 
#12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0
#13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1
#15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6
#17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", 
lineno=839, funcname=0xd3bf3a "float8div",
domain=0x0) at elog.c:492
#18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836
#19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, 
econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030",
isDone=0x7ffd04d2bd04) at execQual.c:1762
#20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, 
isNull=0x7ffd04d2c0e0 "\030",
isDone=0x7ffd04d2bd04) at execQual.c:2250
#21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, 
argList=0x324b378, econtext=0x32495d8) at execQual.c:1317
#22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, 
econtext=0x32495d8,
isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at execQual.c:1532
#23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, 
isNull=0x7ffd04d2c5c1 "\306\322\004\375\177",
isDone=0x0) at execQual.c:2228
#24 0x0076eed2 in initFcinfo (wrxstate=0x31b8fe0, 
fcinfo=0x7ffd04d2c280, funcstate=0x7f83c7412318, econtext=0x32495d8,
check_nulls=1 '\001') at nodeWindow.c:3201
#25 0x0076efa4 in add_tuple_to_trans (funcstate=0x7f83c7412318, 
wstate=0x3248ab8, econtext=0x32495d8,
check_nulls=1 '\001') at nodeWindow.c:3223
#26 0x00772f72 in processTupleSlot (wstate=0x3248ab8, slot=0x31ac150, 
last_peer=0 '\000') at nodeWindow.c:5105
#27 0x00772760 in ExecWindow (wstate=0x3248ab8) at nodeWindow.c:4821
---Type  to continue, or q  to quit---
#28 0x0071eda7 in ExecProcNode (node=0x3248ab8) at execProcnode.c:1007
#29 0x0075aded in NextInputSlot (node=0x31af928) at nodeResult.c:95