[GitHub] incubator-hawq pull request #1257: HAWQ-1487. Fix hang process due to deadlo...
GitHub user huor opened a pull request: https://github.com/apache/incubator-hawq/pull/1257 HAWQ-1487. Fix hang process due to deadlock when it try to process interrupt in error handling You can merge this pull request into a Git repository by running: $ git pull https://github.com/huor/incubator-hawq interrupt Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hawq/pull/1257.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1257 commit 04773fca3790e705402a3e9a698aa65e5efb392d Author: Ruilong HuoDate: 2017-06-13T10:11:01Z HAWQ-1487. Fix hang process due to deadlock when it try to process interrupt in error handling --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling
[ https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruilong Huo updated HAWQ-1487: -- Affects Version/s: 2.2.0.0-incubating > hang process due to deadlock when it try to process interrupt in error > handling > --- > > Key: HAWQ-1487 > URL: https://issues.apache.org/jira/browse/HAWQ-1487 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution >Affects Versions: 2.2.0.0-incubating >Reporter: Ruilong Huo >Assignee: Ruilong Huo > Fix For: 2.3.0.0-incubating > > > It has hang process when it try to process interrupt in error handling. To be > specific, some QE encounter division by zero error, and then it error out. > During the error processing, it try to handle query cancelling interrupt and > thus deadlock occur. > The hang process is: > {noformat} > $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep" > gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, > logger p > gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats > co > gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, > writer p > gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, > checkpoi > gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, > segment > gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, > hawqsupe > $ ps -ef | grep postgres | grep -v grep > gpadmin 51245 1 0 06:15 ?00:01:01 > /usr/local/hawq_2_2_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, > logger process > gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats > collector process > gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, > writer process > gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, > checkpoint process > gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, > segment resource manager > gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 > MPPEXEC SELECT > gpadmin 194424 194402 0 23:50 pts/000:00:00 grep postgres > {noformat} > The call stack is: > {noformat} > $ sudo gdb -p 182983 > (gdb) bt > #0 0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0 > #2 0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #4 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #5 0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1 > #6 0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #7 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 > #8 0x009cda3f in errstart (elevel=20, filename=0xd309e0 > "postgres.c", lineno=3618, > funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492 > #9 0x008e8fcb in ProcessInterrupts () at postgres.c:3616 > #10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at > postgres.c:3463 > #11 > #12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 > #17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", > lineno=839, funcname=0xd3bf3a "float8div", > domain=0x0) at elog.c:492 > #18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836 > #19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, > econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:1762 > #20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, > isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:2250 > #21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, > argList=0x324b378, econtext=0x32495d8) at execQual.c:1317 > #22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, > econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at > execQual.c:1532 > #23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", > isDone=0x0) at
[jira] [Updated] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling
[ https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruilong Huo updated HAWQ-1487: -- Fix Version/s: 2.3.0.0-incubating > hang process due to deadlock when it try to process interrupt in error > handling > --- > > Key: HAWQ-1487 > URL: https://issues.apache.org/jira/browse/HAWQ-1487 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution >Reporter: Ruilong Huo >Assignee: Ruilong Huo > Fix For: 2.3.0.0-incubating > > > It has hang process when it try to process interrupt in error handling. To be > specific, some QE encounter division by zero error, and then it error out. > During the error processing, it try to handle query cancelling interrupt and > thus deadlock occur. > The hang process is: > {noformat} > $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep" > gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, > logger p > gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats > co > gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, > writer p > gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, > checkpoi > gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, > segment > gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, > hawqsupe > $ ps -ef | grep postgres | grep -v grep > gpadmin 51245 1 0 06:15 ?00:01:01 > /usr/local/hawq_2_2_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, > logger process > gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats > collector process > gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, > writer process > gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, > checkpoint process > gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, > segment resource manager > gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 > MPPEXEC SELECT > gpadmin 194424 194402 0 23:50 pts/000:00:00 grep postgres > {noformat} > The call stack is: > {noformat} > $ sudo gdb -p 182983 > (gdb) bt > #0 0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0 > #2 0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #4 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #5 0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1 > #6 0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #7 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 > #8 0x009cda3f in errstart (elevel=20, filename=0xd309e0 > "postgres.c", lineno=3618, > funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492 > #9 0x008e8fcb in ProcessInterrupts () at postgres.c:3616 > #10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at > postgres.c:3463 > #11 > #12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 > #17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", > lineno=839, funcname=0xd3bf3a "float8div", > domain=0x0) at elog.c:492 > #18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836 > #19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, > econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:1762 > #20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, > isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:2250 > #21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, > argList=0x324b378, econtext=0x32495d8) at execQual.c:1317 > #22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, > econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at > execQual.c:1532 > #23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", > isDone=0x0) at execQual.c:2228 > #24 0x0076eed2 in initFcinfo
[jira] [Assigned] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling
[ https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruilong Huo reassigned HAWQ-1487: - Assignee: Ruilong Huo (was: Lei Chang) > hang process due to deadlock when it try to process interrupt in error > handling > --- > > Key: HAWQ-1487 > URL: https://issues.apache.org/jira/browse/HAWQ-1487 > Project: Apache HAWQ > Issue Type: Bug > Components: Query Execution >Reporter: Ruilong Huo >Assignee: Ruilong Huo > Fix For: 2.3.0.0-incubating > > > It has hang process when it try to process interrupt in error handling. To be > specific, some QE encounter division by zero error, and then it error out. > During the error processing, it try to handle query cancelling interrupt and > thus deadlock occur. > The hang process is: > {noformat} > $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep" > gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, > logger p > gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats > co > gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, > writer p > gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, > checkpoi > gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, > segment > gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, > hawqsupe > $ ps -ef | grep postgres | grep -v grep > gpadmin 51245 1 0 06:15 ?00:01:01 > /usr/local/hawq_2_2_0_0/bin/postgres -D > /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd > -i -M segment -p 20100 --silent-mode=true > gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, > logger process > gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats > collector process > gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, > writer process > gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, > checkpoint process > gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, > segment resource manager > gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, > hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 > MPPEXEC SELECT > gpadmin 194424 194402 0 23:50 pts/000:00:00 grep postgres > {noformat} > The call stack is: > {noformat} > $ sudo gdb -p 182983 > (gdb) bt > #0 0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0 > #2 0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #3 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #4 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #5 0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1 > #6 0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #7 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 > #8 0x009cda3f in errstart (elevel=20, filename=0xd309e0 > "postgres.c", lineno=3618, > funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492 > #9 0x008e8fcb in ProcessInterrupts () at postgres.c:3616 > #10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at > postgres.c:3463 > #11 > #12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 > #14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 > #15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 > #17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", > lineno=839, funcname=0xd3bf3a "float8div", > domain=0x0) at elog.c:492 > #18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836 > #19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, > econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:1762 > #20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, > isNull=0x7ffd04d2c0e0 "\030", > isDone=0x7ffd04d2bd04) at execQual.c:2250 > #21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, > argList=0x324b378, econtext=0x32495d8) at execQual.c:1317 > #22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, > econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at > execQual.c:1532 > #23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, > isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", > isDone=0x0) at execQual.c:2228 > #24 0x0076eed2 in
[jira] [Updated] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling
[ https://issues.apache.org/jira/browse/HAWQ-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruilong Huo updated HAWQ-1487: -- Description: It has hang process when it try to process interrupt in error handling. To be specific, some QE encounter division by zero error, and then it error out. During the error processing, it try to handle query cancelling interrupt and thus deadlock occur. The hang process is: {noformat} $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep" gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, logger p gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats co gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, writer p gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, checkpoi gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, segment gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, hawqsupe $ ps -ef | grep postgres | grep -v grep gpadmin 51245 1 0 06:15 ?00:01:01 /usr/local/hawq_2_2_0_0/bin/postgres -D /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd -i -M segment -p 20100 --silent-mode=true gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, logger process gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats collector process gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, writer process gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, checkpoint process gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, segment resource manager gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 MPPEXEC SELECT gpadmin 194424 194402 0 23:50 pts/000:00:00 grep postgres {noformat} The call stack is: {noformat} $ sudo gdb -p 182983 (gdb) bt #0 0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 #4 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 #5 0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1 #6 0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 #7 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 #8 0x009cda3f in errstart (elevel=20, filename=0xd309e0 "postgres.c", lineno=3618, funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492 #9 0x008e8fcb in ProcessInterrupts () at postgres.c:3616 #10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at postgres.c:3463 #11 #12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0 #13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 #14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 #15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 #16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 #17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", lineno=839, funcname=0xd3bf3a "float8div", domain=0x0) at elog.c:492 #18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836 #19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", isDone=0x7ffd04d2bd04) at execQual.c:1762 #20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", isDone=0x7ffd04d2bd04) at execQual.c:2250 #21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, argList=0x324b378, econtext=0x32495d8) at execQual.c:1317 #22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, econtext=0x32495d8, isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at execQual.c:1532 #23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at execQual.c:2228 #24 0x0076eed2 in initFcinfo (wrxstate=0x31b8fe0, fcinfo=0x7ffd04d2c280, funcstate=0x7f83c7412318, econtext=0x32495d8, check_nulls=1 '\001') at nodeWindow.c:3201 #25 0x0076efa4 in add_tuple_to_trans (funcstate=0x7f83c7412318, wstate=0x3248ab8, econtext=0x32495d8, check_nulls=1 '\001') at nodeWindow.c:3223 #26 0x00772f72 in processTupleSlot (wstate=0x3248ab8, slot=0x31ac150, last_peer=0 '\000') at nodeWindow.c:5105 #27 0x00772760 in ExecWindow (wstate=0x3248ab8) at nodeWindow.c:4821 ---Type to continue, or q to quit--- #28 0x0071eda7 in ExecProcNode (node=0x3248ab8) at execProcnode.c:1007 #29 0x0075aded in NextInputSlot (node=0x31af928) at nodeResult.c:95
[jira] [Created] (HAWQ-1487) hang process due to deadlock when it try to process interrupt in error handling
Ruilong Huo created HAWQ-1487: - Summary: hang process due to deadlock when it try to process interrupt in error handling Key: HAWQ-1487 URL: https://issues.apache.org/jira/browse/HAWQ-1487 Project: Apache HAWQ Issue Type: Bug Components: Query Execution Reporter: Ruilong Huo Assignee: Lei Chang It has hang process when it try to process interrupt in error handling. To be specific, some QE encounter division by zero error, and then it error out. During the error processing, it try to handle query cancelling interrupt and thus deadlock occur. The hang process is: {noformat} $ hawq ssh -f hostfile -e "ps -ef | grep postgres | grep -v grep" gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, logger p gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats co gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, writer p gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, checkpoi gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, segment gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, hawqsupe $ ps -ef | grep postgres | grep -v grep gpadmin 51245 1 0 06:15 ?00:01:01 /usr/local/hawq_2_2_0_0/bin/postgres -D /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd -i -M segment -p 20100 --silent-mode=true gpadmin 51246 51245 0 06:15 ?00:00:01 postgres: port 20100, logger process gpadmin 51249 51245 0 06:15 ?00:00:00 postgres: port 20100, stats collector process gpadmin 51250 51245 0 06:15 ?00:00:07 postgres: port 20100, writer process gpadmin 51251 51245 0 06:15 ?00:00:01 postgres: port 20100, checkpoint process gpadmin 51252 51245 0 06:15 ?00:00:11 postgres: port 20100, segment resource manager gpadmin 182983 51245 0 07:00 ?00:00:03 postgres: port 20100, hawqsuperuser olap_winow... 10.32.34.225(45462) con4405 seg0 cmd2 slice7 MPPEXEC SELECT gpadmin 194424 194402 0 23:50 pts/000:00:00 grep postgres {noformat} The call stack is: {noformat} $ sudo gdb -p 182983 (gdb) bt #0 0x003ff060e2e4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x003ff0609588 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x003ff0609457 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 #4 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 #5 0x003ff220ff49 in ?? () from /lib64/libgcc_s.so.1 #6 0x003ff22100e7 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 #7 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 #8 0x009cda3f in errstart (elevel=20, filename=0xd309e0 "postgres.c", lineno=3618, funcname=0xd32fc0 "ProcessInterrupts", domain=0x0) at elog.c:492 #9 0x008e8fcb in ProcessInterrupts () at postgres.c:3616 #10 0x008e8c9e in StatementCancelHandler (postgres_signal_arg=2) at postgres.c:3463 #11 #12 0x003ff0609451 in pthread_mutex_lock () from /lib64/libpthread.so.0 #13 0x003ff221206a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 #14 0x003ff220f603 in ?? () from /lib64/libgcc_s.so.1 #15 0x003ff2210119 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 #16 0x003ff02fe966 in backtrace () from /lib64/libc.so.6 #17 0x009cda3f in errstart (elevel=20, filename=0xd3ba00 "float.c", lineno=839, funcname=0xd3bf3a "float8div", domain=0x0) at elog.c:492 #18 0x00921a84 in float8div (fcinfo=0x7ffd04d2b8b0) at float.c:836 #19 0x00722fe5 in ExecMakeFunctionResult (fcache=0x324a088, econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", isDone=0x7ffd04d2bd04) at execQual.c:1762 #20 0x00723d87 in ExecEvalOper (fcache=0x324a088, econtext=0x32495d8, isNull=0x7ffd04d2c0e0 "\030", isDone=0x7ffd04d2bd04) at execQual.c:2250 #21 0x00722451 in ExecEvalFuncArgs (fcinfo=0x7ffd04d2bda0, argList=0x324b378, econtext=0x32495d8) at execQual.c:1317 #22 0x00722a68 in ExecMakeFunctionResult (fcache=0x3249850, econtext=0x32495d8, isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at execQual.c:1532 #23 0x00723d1e in ExecEvalFunc (fcache=0x3249850, econtext=0x32495d8, isNull=0x7ffd04d2c5c1 "\306\322\004\375\177", isDone=0x0) at execQual.c:2228 #24 0x0076eed2 in initFcinfo (wrxstate=0x31b8fe0, fcinfo=0x7ffd04d2c280, funcstate=0x7f83c7412318, econtext=0x32495d8, check_nulls=1 '\001') at nodeWindow.c:3201 #25 0x0076efa4 in add_tuple_to_trans (funcstate=0x7f83c7412318, wstate=0x3248ab8, econtext=0x32495d8, check_nulls=1 '\001') at nodeWindow.c:3223 #26 0x00772f72 in processTupleSlot (wstate=0x3248ab8, slot=0x31ac150, last_peer=0 '\000') at nodeWindow.c:5105 #27 0x00772760 in ExecWindow
[jira] [Closed] (HAWQ-1485) Use user/password instead of credentials cache in Ranger lookup for HAWQ with Kerberos enabled.
[ https://issues.apache.org/jira/browse/HAWQ-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hongxu Ma closed HAWQ-1485. --- Resolution: Fixed fixed > Use user/password instead of credentials cache in Ranger lookup for HAWQ with > Kerberos enabled. > --- > > Key: HAWQ-1485 > URL: https://issues.apache.org/jira/browse/HAWQ-1485 > Project: Apache HAWQ > Issue Type: Sub-task > Components: Security >Reporter: Hongxu Ma >Assignee: Hongxu Ma > Fix For: 2.3.0.0-incubating > > > When used credentials cache: > Try error password in Ranger UI doesn't destroy the existed kerberos > credentials (created by last success kinit command) > It's a strange behavior to user. > So we should use user/password for kerberos authentication. > Core logic: > {code} > Properties props = new Properties(); > if (connectionProperties.containsKey(AUTHENTICATION) && > connectionProperties.get(AUTHENTICATION).equals(KERBEROS)) { > //kerberos mode > props.setProperty("kerberosServerName", > connectionProperties.get("principal")); > props.setProperty("jaasApplicationName", "pgjdbc"); > } > String url = String.format("jdbc:postgresql://%s:%s/%s", > connectionProperties.get("hostname"), connectionProperties.get("port"), db); > props.setProperty("user", connectionProperties.get("username")); > props.setProperty("password", connectionProperties.get("password")); > return DriverManager.getConnection(url, props); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HAWQ-1484) Spin PXF into a Separate Project for Data Access
[ https://issues.apache.org/jira/browse/HAWQ-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051065#comment-16051065 ] Shivram Mani commented on HAWQ-1484: [~sirinath] when you say separate project, do you mean a separate top level project with its own repository ? > Spin PXF into a Separate Project for Data Access > > > Key: HAWQ-1484 > URL: https://issues.apache.org/jira/browse/HAWQ-1484 > Project: Apache HAWQ > Issue Type: New Feature >Reporter: Suminda Dharmasena >Assignee: Radar Lei > > Can the PXF be spinned into a seperate projects here they can be used as a > basis for other data access projects. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (HAWQ-1486) PANIC accessing PXF HDFS table
[ https://issues.apache.org/jira/browse/HAWQ-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleksandr Diachenko resolved HAWQ-1486. --- Resolution: Fixed Merged to master. > PANIC accessing PXF HDFS table > -- > > Key: HAWQ-1486 > URL: https://issues.apache.org/jira/browse/HAWQ-1486 > Project: Apache HAWQ > Issue Type: Bug > Components: External Tables, PXF >Reporter: John Gaskin >Assignee: Vineet Goel > Fix For: 2.3.0.0-incubating > > > This code doesn't catch the case when churl_init_download() returns NULL. > This seems to trigger a segfault at libcurl level. > {code} > Looks like we failed to connect to PXF (?). > Piece of code in HAWQ handling cUrl calls (pxfutils.c): > 100 static void process_request(ClientContext* client_context, char *uri) > 101 { > 102 size_t n = 0; > 103 char buffer[RAW_BUF_SIZE]; > 104 > 105 print_http_headers(client_context->http_headers); > 106 client_context->handle = churl_init_download(uri, > client_context->http_headers); > 107 memset(buffer, 0, RAW_BUF_SIZE); > 108 resetStringInfo(&(client_context->the_rest_buf)); > 109 > 110 /* > 111 * This try-catch ensures that in case of an exception during the > "communication with PXF and the accumulation of > 112 * PXF data in client_context->the_rest_buf", we still get to > terminate the libcurl connection nicely and avoid > 113 * leaving the PXF server connection hung. > 114 */ > 115 PG_TRY(); > 116 { > 117 /* read some bytes to make sure the connection is established */ > 118 churl_read_check_connectivity(client_context->handle); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[GitHub] incubator-hawq issue #1255: HAWQ-1486. Catch error out on NULL condition for...
Github user sansanichfb commented on the issue: https://github.com/apache/incubator-hawq/pull/1255 Merged code to master, feel free to close PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq pull request #1256: HAWQ-1485. Fix exception of decryptPasswo...
Github user asfgit closed the pull request at: https://github.com/apache/incubator-hawq/pull/1256 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hawq issue #1256: HAWQ-1485. Fix exception of decryptPassword twic...
Github user linwen commented on the issue: https://github.com/apache/incubator-hawq/pull/1256 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---