[jira] [Commented] (HAWQ-564) QD hangs when connecting to resource manager
[ https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206095#comment-15206095 ] Chunling Wang commented on HAWQ-564: And 'kill -6' can cause same result. > QD hangs when connecting to resource manager > > > Key: HAWQ-564 > URL: https://issues.apache.org/jira/browse/HAWQ-564 > Project: Apache HAWQ > Issue Type: Bug > Components: Resource Manager >Affects Versions: 2.0.0 >Reporter: Chunling Wang >Assignee: Lei Chang > > When first inject panic in QE process, we run a query and segment is down. > After the segment is up, we run another query and get correct answer. Then we > inject the same panic second time. After the segment is down and then up > again, we run a query and find QD process hangs when connecting to resource > manager. Here is the backtrace when QD hangs: > {code} > * thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + > 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP > * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 > frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at > rmcomm_AsyncComm.c:156 > frame #2: 0x000101db85f5 > postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, > sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, > exprecvmsgid=2307, recvsmb=, errorbuf=0x00010230c1a0, > errorbufsize=) + 645 at rmcomm_SyncComm.c:122 > frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] > callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, > sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, > errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780 > frame #4: 0x000101db2d3c > postgres`acquireResourceFromRM(index=, sessionid=12, > slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, > preferred_nodes_size=, max_seg_count_fix=, > min_seg_count_fix=, errorbuf=, > errorbufsize=) + 572 at rmcomm_QD2RM.c:742 > frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, > slice_size=5, iobytes=134217728, max_target_segment_num=1, > min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 > at pquery.c:796 > frame #6: 0x000101e8c60f > postgres`calculate_planner_segment_num(query=, > resourceLife=QRL_ONCE, fullRangeTable=, > intoPolicy=, sliceNum=5) + 14287 at cdbdatalocality.c:4207 > frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496 > frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, > cursorOptions=, boundParams=0x, > resourceLife=QRL_ONCE) + 311 at planner.c:310 > frame #9: 0x000101c8eb33 > postgres`pg_plan_query(querytree=0x7f9c1a02a140, > boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837 > frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at > postgres.c:911 > frame #11: 0x000101c95699 > postgres`exec_simple_query(query_string=0x7f9c1a028a30, > seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671 > frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, > argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754 > frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + > 105 at postmaster.c:5889 > frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484 > frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at > postmaster.c:2163 > frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, > argv=) + 5019 at postmaster.c:1454 > frame #17: 0x000101bb1aa9 postgres`main(argc=9, > argv=0x7f9c19c1eef0) + 1433 at main.c:209 > frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1 > thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + > 10 > frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 > frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + > 2163 at ic_udp.c:6251 > frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 > frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 > frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 > thread #3: tid = 0x21d9c2, 0x7fff890343f6 > libsystem_kernel.dylib`__select + 10 > frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 > frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + > 78 at pgsleep.c:43 > frame #2: 0x000101db1a66 > postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at > rmcomm_QD2RM.c:1519 > frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 > frame #4: 0x7fff95e82279 lib
[jira] [Created] (HAWQ-572) Improve code coverage for dispatcher: fail_qe_after_connection & fail_qe_when_do_query & fail_qe_when_begin_parquet_scan
Chunling Wang created HAWQ-572: -- Summary: Improve code coverage for dispatcher: fail_qe_after_connection & fail_qe_when_do_query & fail_qe_when_begin_parquet_scan Key: HAWQ-572 URL: https://issues.apache.org/jira/browse/HAWQ-572 Project: Apache HAWQ Issue Type: Sub-task Components: Dispatcher Reporter: Chunling Wang Assignee: Lei Chang Add those fault injections: 1. fail_qe_after_connection 2. fail_qe_when_do_query 3. fail_qe_when_begin_parquet_scan And add test cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HAWQ-568) After query finished, kill a QE but can still recv() data from this QE socket
[ https://issues.apache.org/jira/browse/HAWQ-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-568: --- Summary: After query finished, kill a QE but can still recv() data from this QE socket (was: After query finished, kill a QE but can still recv() from this QE socket) > After query finished, kill a QE but can still recv() data from this QE socket > - > > Key: HAWQ-568 > URL: https://issues.apache.org/jira/browse/HAWQ-568 > Project: Apache HAWQ > Issue Type: Bug > Components: Dispatcher >Affects Versions: 2.0.0 >Reporter: Chunling Wang >Assignee: Lei Chang > > After query finished, we kill a QE and other QEs remain in QE pool. When > check the connection to this QE is whether alive, we use recv() to this QE > socket, but can still receive data. > 1. Run a query and remain some QEs. > {code} > dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, > test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; > count > --- > 3725 > (1 row) > {code} > {code} > $ ps -ef|grep postgres > 501 55701 1 0 5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres > -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 > --silent-mode=true > 501 55702 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, master > logger process > 501 55705 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, stats > collector process > 501 55706 55701 0 5:38下午 ?? 0:00.04 postgres: port 5432, writer > process > 501 55707 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, > checkpoint process > 501 55708 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, > seqserver process > 501 55709 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, WAL > Send Server process > 501 55710 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, DFS > Metadata Cache process > 501 55711 55701 0 5:38下午 ?? 0:00.26 postgres: port 5432, master > resource manager > 501 55727 1 0 5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres > -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 > --silent-mode=true > 501 55728 55727 0 5:38下午 ?? 0:00.06 postgres: port 4, logger > process > 501 55731 55727 0 5:38下午 ?? 0:00.00 postgres: port 4, stats > collector process > 501 55732 55727 0 5:38下午 ?? 0:00.04 postgres: port 4, writer > process > 501 55733 55727 0 5:38下午 ?? 0:00.01 postgres: port 4, > checkpoint process > 501 55734 55727 0 5:38下午 ?? 0:00.09 postgres: port 4, > segment resource manager > 501 55741 55748 0 5:38下午 ?? 0:00.05 postgres: port 5432, > wangchunling dispatch [local] con12 cmd6 idle [local] > 501 55743 55727 0 5:38下午 ?? 0:00.36 postgres: port 4, > wangchunling dispatch 127.0.0.1(50800) con12 seg0 idle > 501 55770 55727 0 5:43下午 ?? 0:00.12 postgres: port 4, > wangchunling dispatch 127.0.0.1(50853) con12 seg0 idle > 501 55771 55727 0 5:44下午 ?? 0:00.11 postgres: port 4, > wangchunling dispatch 127.0.0.1(50855) con12 seg0 idle > 501 55774 26980 0 5:44下午 ttys0080:00.00 grep postgres > {code} > 2. Kill one QE. > {code} > $ kill 55771 > $ ps -ef|grep postgres > 501 55701 1 0 5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres > -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 > --silent-mode=true > 501 55702 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, master > logger process > 501 55705 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, stats > collector process > 501 55706 55701 0 5:38下午 ?? 0:00.04 postgres: port 5432, writer > process > 501 55707 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, > checkpoint process > 501 55708 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, > seqserver process > 501 55709 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, WAL > Send Server process > 501 55710 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, DFS > Metadata Cache process > 501 55711 55701 0 5:38下午 ?? 0:00.27 postgres: port 5432, master > resource manager > 501 55727 1 0 5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres > -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 > --silent-mode=true > 501 55728 55727 0 5:38下午 ?? 0:00.06 postgres: port 4, logger > process > 501 55731 55727 0 5:38下午 ?? 0:00.00 postgres: port 4, stats > collector process > 501 55732 55727 0 5:38下午 ?? 0:00.04 postgres: port 4, writer > process > 501
[jira] [Created] (HAWQ-568) After query finished, kill a QE but can still recv() from this QE socket
Chunling Wang created HAWQ-568: -- Summary: After query finished, kill a QE but can still recv() from this QE socket Key: HAWQ-568 URL: https://issues.apache.org/jira/browse/HAWQ-568 Project: Apache HAWQ Issue Type: Bug Components: Dispatcher Reporter: Chunling Wang Assignee: Lei Chang After query finished, we kill a QE and other QEs remain in QE pool. When check the connection to this QE is whether alive, we use recv() to this QE socket, but can still receive data. 1. Run a query and remain some QEs. {code} dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; count --- 3725 (1 row) {code} {code} $ ps -ef|grep postgres 501 55701 1 0 5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 --silent-mode=true 501 55702 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, master logger process 501 55705 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, stats collector process 501 55706 55701 0 5:38下午 ?? 0:00.04 postgres: port 5432, writer process 501 55707 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, checkpoint process 501 55708 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, seqserver process 501 55709 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, WAL Send Server process 501 55710 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, DFS Metadata Cache process 501 55711 55701 0 5:38下午 ?? 0:00.26 postgres: port 5432, master resource manager 501 55727 1 0 5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 --silent-mode=true 501 55728 55727 0 5:38下午 ?? 0:00.06 postgres: port 4, logger process 501 55731 55727 0 5:38下午 ?? 0:00.00 postgres: port 4, stats collector process 501 55732 55727 0 5:38下午 ?? 0:00.04 postgres: port 4, writer process 501 55733 55727 0 5:38下午 ?? 0:00.01 postgres: port 4, checkpoint process 501 55734 55727 0 5:38下午 ?? 0:00.09 postgres: port 4, segment resource manager 501 55741 55748 0 5:38下午 ?? 0:00.05 postgres: port 5432, wangchunling dispatch [local] con12 cmd6 idle [local] 501 55743 55727 0 5:38下午 ?? 0:00.36 postgres: port 4, wangchunling dispatch 127.0.0.1(50800) con12 seg0 idle 501 55770 55727 0 5:43下午 ?? 0:00.12 postgres: port 4, wangchunling dispatch 127.0.0.1(50853) con12 seg0 idle 501 55771 55727 0 5:44下午 ?? 0:00.11 postgres: port 4, wangchunling dispatch 127.0.0.1(50855) con12 seg0 idle 501 55774 26980 0 5:44下午 ttys0080:00.00 grep postgres {code} 2. Kill one QE. {code} $ kill 55771 $ ps -ef|grep postgres 501 55701 1 0 5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 --silent-mode=true 501 55702 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, master logger process 501 55705 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, stats collector process 501 55706 55701 0 5:38下午 ?? 0:00.04 postgres: port 5432, writer process 501 55707 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, checkpoint process 501 55708 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, seqserver process 501 55709 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, WAL Send Server process 501 55710 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, DFS Metadata Cache process 501 55711 55701 0 5:38下午 ?? 0:00.27 postgres: port 5432, master resource manager 501 55727 1 0 5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 --silent-mode=true 501 55728 55727 0 5:38下午 ?? 0:00.06 postgres: port 4, logger process 501 55731 55727 0 5:38下午 ?? 0:00.00 postgres: port 4, stats collector process 501 55732 55727 0 5:38下午 ?? 0:00.04 postgres: port 4, writer process 501 55733 55727 0 5:38下午 ?? 0:00.01 postgres: port 4, checkpoint process 501 55734 55727 0 5:38下午 ?? 0:00.09 postgres: port 4, segment resource manager 501 55741 55748 0 5:38下午 ?? 0:00.05 postgres: port 5432, wangchunling dispatch [local] con12 cmd6 idle [local] 501 55743 55727 0 5:38下午 ?? 0:00.36 postgres: port 4, wangchunling dispatch 127.0.0.1(50800) con12 seg0 idle 501 55770 55727 0 5:43下午 ?? 0:00.12 postgres: port 4, wangchunling dispatch 127.0.0.1(50853) con12 seg0 idle 501 55776 269
[jira] [Updated] (HAWQ-568) After query finished, kill a QE but can still recv() from this QE socket
[ https://issues.apache.org/jira/browse/HAWQ-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-568: --- Affects Version/s: 2.0.0 > After query finished, kill a QE but can still recv() from this QE socket > > > Key: HAWQ-568 > URL: https://issues.apache.org/jira/browse/HAWQ-568 > Project: Apache HAWQ > Issue Type: Bug > Components: Dispatcher >Affects Versions: 2.0.0 >Reporter: Chunling Wang >Assignee: Lei Chang > > After query finished, we kill a QE and other QEs remain in QE pool. When > check the connection to this QE is whether alive, we use recv() to this QE > socket, but can still receive data. > 1. Run a query and remain some QEs. > {code} > dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, > test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; > count > --- > 3725 > (1 row) > {code} > {code} > $ ps -ef|grep postgres > 501 55701 1 0 5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres > -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 > --silent-mode=true > 501 55702 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, master > logger process > 501 55705 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, stats > collector process > 501 55706 55701 0 5:38下午 ?? 0:00.04 postgres: port 5432, writer > process > 501 55707 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, > checkpoint process > 501 55708 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, > seqserver process > 501 55709 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, WAL > Send Server process > 501 55710 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, DFS > Metadata Cache process > 501 55711 55701 0 5:38下午 ?? 0:00.26 postgres: port 5432, master > resource manager > 501 55727 1 0 5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres > -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 > --silent-mode=true > 501 55728 55727 0 5:38下午 ?? 0:00.06 postgres: port 4, logger > process > 501 55731 55727 0 5:38下午 ?? 0:00.00 postgres: port 4, stats > collector process > 501 55732 55727 0 5:38下午 ?? 0:00.04 postgres: port 4, writer > process > 501 55733 55727 0 5:38下午 ?? 0:00.01 postgres: port 4, > checkpoint process > 501 55734 55727 0 5:38下午 ?? 0:00.09 postgres: port 4, > segment resource manager > 501 55741 55748 0 5:38下午 ?? 0:00.05 postgres: port 5432, > wangchunling dispatch [local] con12 cmd6 idle [local] > 501 55743 55727 0 5:38下午 ?? 0:00.36 postgres: port 4, > wangchunling dispatch 127.0.0.1(50800) con12 seg0 idle > 501 55770 55727 0 5:43下午 ?? 0:00.12 postgres: port 4, > wangchunling dispatch 127.0.0.1(50853) con12 seg0 idle > 501 55771 55727 0 5:44下午 ?? 0:00.11 postgres: port 4, > wangchunling dispatch 127.0.0.1(50855) con12 seg0 idle > 501 55774 26980 0 5:44下午 ttys0080:00.00 grep postgres > {code} > 2. Kill one QE. > {code} > $ kill 55771 > $ ps -ef|grep postgres > 501 55701 1 0 5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres > -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 > --silent-mode=true > 501 55702 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, master > logger process > 501 55705 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, stats > collector process > 501 55706 55701 0 5:38下午 ?? 0:00.04 postgres: port 5432, writer > process > 501 55707 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, > checkpoint process > 501 55708 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, > seqserver process > 501 55709 55701 0 5:38下午 ?? 0:00.01 postgres: port 5432, WAL > Send Server process > 501 55710 55701 0 5:38下午 ?? 0:00.00 postgres: port 5432, DFS > Metadata Cache process > 501 55711 55701 0 5:38下午 ?? 0:00.27 postgres: port 5432, master > resource manager > 501 55727 1 0 5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres > -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 > --silent-mode=true > 501 55728 55727 0 5:38下午 ?? 0:00.06 postgres: port 4, logger > process > 501 55731 55727 0 5:38下午 ?? 0:00.00 postgres: port 4, stats > collector process > 501 55732 55727 0 5:38下午 ?? 0:00.04 postgres: port 4, writer > process > 501 55733 55727 0 5:38下午 ?? 0:00.01 postgres: port 4, > checkpoint process > 501 55734 55727 0 5:38下午 ?? 0:00.09 postgres: port 40
[jira] [Commented] (HAWQ-564) QD hangs when connecting to resource manager
[ https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203864#comment-15203864 ] Chunling Wang commented on HAWQ-564: There is another way to cause this bug without fault injection. 1. First run query and get some QEs. {code} dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; count --- 3725 (1 row) {code} {code} $ ps -ef|grep postgres 501 30190 1 0 2:34下午 ?? 0:00.31 /usr/local/hawq/bin/postgres -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 --silent-mode=true 501 30191 30190 0 2:34下午 ?? 0:00.01 postgres: port 5432, master logger process 501 30194 30190 0 2:34下午 ?? 0:00.00 postgres: port 5432, stats collector process 501 30195 30190 0 2:34下午 ?? 0:00.01 postgres: port 5432, writer process 501 30196 30190 0 2:34下午 ?? 0:00.00 postgres: port 5432, checkpoint process 501 30197 30190 0 2:34下午 ?? 0:00.00 postgres: port 5432, seqserver process 501 30198 30190 0 2:34下午 ?? 0:00.00 postgres: port 5432, WAL Send Server process 501 30199 30190 0 2:34下午 ?? 0:00.00 postgres: port 5432, DFS Metadata Cache process 501 30200 30190 0 2:34下午 ?? 0:00.07 postgres: port 5432, master resource manager 501 30216 1 0 2:34下午 ?? 0:00.37 /usr/local/hawq/bin/postgres -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 --silent-mode=true 501 30217 30216 0 2:34下午 ?? 0:00.02 postgres: port 4, logger process 501 30220 30216 0 2:34下午 ?? 0:00.00 postgres: port 4, stats collector process 501 30221 30216 0 2:34下午 ?? 0:00.01 postgres: port 4, writer process 501 30222 30216 0 2:34下午 ?? 0:00.00 postgres: port 4, checkpoint process 501 30223 30216 0 2:34下午 ?? 0:00.03 postgres: port 4, segment resource manager 501 30231 30190 0 2:35下午 ?? 0:00.03 postgres: port 5432, wangchunling dispatch [local] con12 cmd6 idle [local] 501 30235 30216 0 2:35下午 ?? 0:00.13 postgres: port 4, wangchunling dispatch 127.0.0.1(65051) con12 seg0 idle 501 30239 30216 0 2:35下午 ?? 0:00.06 postgres: port 4, wangchunling dispatch 127.0.0.1(65061) con12 seg0 idle 501 30240 30216 0 2:35下午 ?? 0:00.06 postgres: port 4, wangchunling dispatch 127.0.0.1(65063) con12 seg0 idle 501 30242 99560 0 2:36下午 ttys0000:00.00 grep postgres {code} 2. Kill some QE and there is no QE. {code} $ kill -9 30235 $ ps -ef|grep postgres 501 30190 1 0 2:34下午 ?? 0:00.32 /usr/local/hawq/bin/postgres -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 --silent-mode=true 501 30191 30190 0 2:34下午 ?? 0:00.01 postgres: port 5432, master logger process 501 30194 30190 0 2:34下午 ?? 0:00.00 postgres: port 5432, stats collector process 501 30195 30190 0 2:34下午 ?? 0:00.01 postgres: port 5432, writer process 501 30196 30190 0 2:34下午 ?? 0:00.00 postgres: port 5432, checkpoint process 501 30197 30190 0 2:34下午 ?? 0:00.00 postgres: port 5432, seqserver process 501 30198 30190 0 2:34下午 ?? 0:00.00 postgres: port 5432, WAL Send Server process 501 30199 30190 0 2:34下午 ?? 0:00.00 postgres: port 5432, DFS Metadata Cache process 501 30200 30190 0 2:34下午 ?? 0:00.08 postgres: port 5432, master resource manager 501 30216 1 0 2:34下午 ?? 0:00.58 /usr/local/hawq/bin/postgres -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 --silent-mode=true 501 30217 30216 0 2:34下午 ?? 0:00.03 postgres: port 4, logger process 501 30231 30190 0 2:35下午 ?? 0:00.04 postgres: port 5432, wangchunling dispatch [local] con12 cmd6 idle [local] 501 30248 30216 0 2:36下午 ?? 0:00.00 postgres: port 4, stats collector process 501 30249 30216 0 2:36下午 ?? 0:00.00 postgres: port 4, writer process 501 30250 30216 0 2:36下午 ?? 0:00.00 postgres: port 4, checkpoint process 501 30251 30216 0 2:36下午 ?? 0:00.00 postgres: port 4, segment resource manager 501 30255 99560 0 2:36下午 ttys0000:00.00 grep postgres {code} 3. Run query again and get some new QEs. {code} dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; count --- 3725 (1 row) {code} {code} $ ps -ef|grep postgres 501 30190 1 0 2:34下午 ?? 0:00.33 /usr/local/hawq/bin/postgres -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 --silent-mode=true 501 30191 30190 0 2:34下午 ?? 0:00.01
[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager
[ https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-564: --- Description: When first inject panic in QE process, we run a query and segment is down. After the segment is up, we run another query and get correct answer. Then we inject the same panic second time. After the segment is down and then up again, we run a query and find QD process hangs when connecting to resource manager. Here is the backtrace when QD hangs: {code} * thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at rmcomm_AsyncComm.c:156 frame #2: 0x000101db85f5 postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) + 645 at rmcomm_SyncComm.c:122 frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780 frame #4: 0x000101db2d3c postgres`acquireResourceFromRM(index=, sessionid=12, slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, preferred_nodes_size=, max_seg_count_fix=, min_seg_count_fix=, errorbuf=, errorbufsize=) + 572 at rmcomm_QD2RM.c:742 frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, slice_size=5, iobytes=134217728, max_target_segment_num=1, min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 at pquery.c:796 frame #6: 0x000101e8c60f postgres`calculate_planner_segment_num(query=, resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, sliceNum=5) + 14287 at cdbdatalocality.c:4207 frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496 frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, cursorOptions=, boundParams=0x, resourceLife=QRL_ONCE) + 311 at planner.c:310 frame #9: 0x000101c8eb33 postgres`pg_plan_query(querytree=0x7f9c1a02a140, boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837 frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at postgres.c:911 frame #11: 0x000101c95699 postgres`exec_simple_query(query_string=0x7f9c1a028a30, seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671 frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754 frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 105 at postmaster.c:5889 frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484 frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at postmaster.c:2163 frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, argv=) + 5019 at postmaster.c:1454 frame #17: 0x000101bb1aa9 postgres`main(argc=9, argv=0x7f9c19c1eef0) + 1433 at main.c:209 frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1 thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 2163 at ic_udp.c:6251 frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 78 at pgsleep.c:43 frame #2: 0x000101db1a66 postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at rmcomm_QD2RM.c:1519 frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 {code} And here is the operations: 1. Before injection, get query answer correctly. {code} dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; count --- 3725 (1 row) {code} 2. Inject panic, fault triggered, and segment is down. {code} dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; ERROR: fault triggered, fault name:'fail_qe_whe
[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager
[ https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-564: --- Affects Version/s: 2.0.0 > QD hangs when connecting to resource manager > > > Key: HAWQ-564 > URL: https://issues.apache.org/jira/browse/HAWQ-564 > Project: Apache HAWQ > Issue Type: Bug > Components: Resource Manager >Affects Versions: 2.0.0 >Reporter: Chunling Wang >Assignee: Lei Chang > > When first inject panic in QE process, we run a query and segment is down. > After the segment is up, we run another query and get correct answer. Then we > inject the same panic second time. After the segment is down and then up > again, we run a query and find QD process hangs when connecting to resource > manager. Here is the backtrace when QD hangs: > {code} > * thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + > 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP > * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 > frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at > rmcomm_AsyncComm.c:156 > frame #2: 0x000101db85f5 > postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, > sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, > exprecvmsgid=2307, recvsmb=, errorbuf=0x00010230c1a0, > errorbufsize=) + 645 at rmcomm_SyncComm.c:122 > frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] > callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, > sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, > errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780 > frame #4: 0x000101db2d3c > postgres`acquireResourceFromRM(index=, sessionid=12, > slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, > preferred_nodes_size=, max_seg_count_fix=, > min_seg_count_fix=, errorbuf=, > errorbufsize=) + 572 at rmcomm_QD2RM.c:742 > frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, > slice_size=5, iobytes=134217728, max_target_segment_num=1, > min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 > at pquery.c:796 > frame #6: 0x000101e8c60f > postgres`calculate_planner_segment_num(query=, > resourceLife=QRL_ONCE, fullRangeTable=, > intoPolicy=, sliceNum=5) + 14287 at cdbdatalocality.c:4207 > frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496 > frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, > cursorOptions=, boundParams=0x, > resourceLife=QRL_ONCE) + 311 at planner.c:310 > frame #9: 0x000101c8eb33 > postgres`pg_plan_query(querytree=0x7f9c1a02a140, > boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837 > frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at > postgres.c:911 > frame #11: 0x000101c95699 > postgres`exec_simple_query(query_string=0x7f9c1a028a30, > seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671 > frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, > argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754 > frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + > 105 at postmaster.c:5889 > frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484 > frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at > postmaster.c:2163 > frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, > argv=) + 5019 at postmaster.c:1454 > frame #17: 0x000101bb1aa9 postgres`main(argc=9, > argv=0x7f9c19c1eef0) + 1433 at main.c:209 > frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1 > thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + > 10 > frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 > frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + > 2163 at ic_udp.c:6251 > frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 > frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 > frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 > thread #3: tid = 0x21d9c2, 0x7fff890343f6 > libsystem_kernel.dylib`__select + 10 > frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 > frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + > 78 at pgsleep.c:43 > frame #2: 0x000101db1a66 > postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at > rmcomm_QD2RM.c:1519 > frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 > frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 > frame #5: 0x7f
[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager
[ https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-564: --- Description: When first inject panic in QE process, we run a query and segment is down. After the segment is up, we run another query and get correct answer. Then we inject the same panic second time. After the segment is down and then up again, we run a query and find QD process hangs when connecting to resource manager. Here is the backtrace when QD hangs: {code} * thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at rmcomm_AsyncComm.c:156 frame #2: 0x000101db85f5 postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) + 645 at rmcomm_SyncComm.c:122 frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780 frame #4: 0x000101db2d3c postgres`acquireResourceFromRM(index=, sessionid=12, slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, preferred_nodes_size=, max_seg_count_fix=, min_seg_count_fix=, errorbuf=, errorbufsize=) + 572 at rmcomm_QD2RM.c:742 frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, slice_size=5, iobytes=134217728, max_target_segment_num=1, min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 at pquery.c:796 frame #6: 0x000101e8c60f postgres`calculate_planner_segment_num(query=, resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, sliceNum=5) + 14287 at cdbdatalocality.c:4207 frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496 frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, cursorOptions=, boundParams=0x, resourceLife=QRL_ONCE) + 311 at planner.c:310 frame #9: 0x000101c8eb33 postgres`pg_plan_query(querytree=0x7f9c1a02a140, boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837 frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at postgres.c:911 frame #11: 0x000101c95699 postgres`exec_simple_query(query_string=0x7f9c1a028a30, seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671 frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754 frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 105 at postmaster.c:5889 frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484 frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at postmaster.c:2163 frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, argv=) + 5019 at postmaster.c:1454 frame #17: 0x000101bb1aa9 postgres`main(argc=9, argv=0x7f9c19c1eef0) + 1433 at main.c:209 frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1 thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 2163 at ic_udp.c:6251 frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 78 at pgsleep.c:43 frame #2: 0x000101db1a66 postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at rmcomm_QD2RM.c:1519 frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 {code} And here is the operations: 1. Before injection, get query answer correctly. {code} dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; count --- 3725 (1 row) {code} 2. Inject panic, fault triggered, and segment is down. {code} dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; ERROR: fault triggered, fault name:'fail_qe_whe
[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager
[ https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-564: --- Description: When first inject panic in QE process, we run a query and segment is down. After the segment is up, we run another query and get correct answer. Then we inject the same panic second time. After the segment is down and then up again, we run a query and find QD process hangs when connecting to resource manager. Here is the backtrace when QD hangs: {code} * thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at rmcomm_AsyncComm.c:156 frame #2: 0x000101db85f5 postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) + 645 at rmcomm_SyncComm.c:122 frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780 frame #4: 0x000101db2d3c postgres`acquireResourceFromRM(index=, sessionid=12, slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, preferred_nodes_size=, max_seg_count_fix=, min_seg_count_fix=, errorbuf=, errorbufsize=) + 572 at rmcomm_QD2RM.c:742 frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, slice_size=5, iobytes=134217728, max_target_segment_num=1, min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 at pquery.c:796 frame #6: 0x000101e8c60f postgres`calculate_planner_segment_num(query=, resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, sliceNum=5) + 14287 at cdbdatalocality.c:4207 frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496 frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, cursorOptions=, boundParams=0x, resourceLife=QRL_ONCE) + 311 at planner.c:310 frame #9: 0x000101c8eb33 postgres`pg_plan_query(querytree=0x7f9c1a02a140, boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837 frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at postgres.c:911 frame #11: 0x000101c95699 postgres`exec_simple_query(query_string=0x7f9c1a028a30, seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671 frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754 frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 105 at postmaster.c:5889 frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484 frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at postmaster.c:2163 frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, argv=) + 5019 at postmaster.c:1454 frame #17: 0x000101bb1aa9 postgres`main(argc=9, argv=0x7f9c19c1eef0) + 1433 at main.c:209 frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1 thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 2163 at ic_udp.c:6251 frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 78 at pgsleep.c:43 frame #2: 0x000101db1a66 postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at rmcomm_QD2RM.c:1519 frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 {code} And here is the operations: 1. Before injection, get query answer correctly. {code} dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; count --- 3725 (1 row) {code} 2. Inject panic, fault triggered, and segment is down. {code} dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; ERROR: fault triggered, fault name:'fail_qe_whe
[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager
[ https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-564: --- Description: When first inject panic in QE process, we run a query and segment is down. After the segment is up, we run another query and get correct answer. Then we inject the same panic second time. After the segment is down and then up again, we run a query and find QD process hangs when connecting to resource manager. Here is the backtrace when QD hangs: {code} * thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at rmcomm_AsyncComm.c:156 frame #2: 0x000101db85f5 postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) + 645 at rmcomm_SyncComm.c:122 frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780 frame #4: 0x000101db2d3c postgres`acquireResourceFromRM(index=, sessionid=12, slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, preferred_nodes_size=, max_seg_count_fix=, min_seg_count_fix=, errorbuf=, errorbufsize=) + 572 at rmcomm_QD2RM.c:742 frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, slice_size=5, iobytes=134217728, max_target_segment_num=1, min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 at pquery.c:796 frame #6: 0x000101e8c60f postgres`calculate_planner_segment_num(query=, resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, sliceNum=5) + 14287 at cdbdatalocality.c:4207 frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496 frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, cursorOptions=, boundParams=0x, resourceLife=QRL_ONCE) + 311 at planner.c:310 frame #9: 0x000101c8eb33 postgres`pg_plan_query(querytree=0x7f9c1a02a140, boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837 frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at postgres.c:911 frame #11: 0x000101c95699 postgres`exec_simple_query(query_string=0x7f9c1a028a30, seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671 frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754 frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 105 at postmaster.c:5889 frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484 frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at postmaster.c:2163 frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, argv=) + 5019 at postmaster.c:1454 frame #17: 0x000101bb1aa9 postgres`main(argc=9, argv=0x7f9c19c1eef0) + 1433 at main.c:209 frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1 thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 2163 at ic_udp.c:6251 frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 78 at pgsleep.c:43 frame #2: 0x000101db1a66 postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at rmcomm_QD2RM.c:1519 frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 {code} And here is the operations: {code} dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; count --- 3725 (1 row) dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id; ERROR: fault triggered, fault name:'fail_qe_when_do_query' fault type:'panic' (faultinjector.c:656) (seg0 localhost:4 pid=26936) dispatch=# select count(*) fr
[jira] [Created] (HAWQ-564) QD hangs when connecting to resource manager
Chunling Wang created HAWQ-564: -- Summary: QD hangs when connecting to resource manager Key: HAWQ-564 URL: https://issues.apache.org/jira/browse/HAWQ-564 Project: Apache HAWQ Issue Type: Bug Components: Resource Manager Reporter: Chunling Wang Assignee: Lei Chang When first inject panic in QE process, we run a query and segment is down. After the segment is up, we run another query and get correct answer. Then we inject the same panic second time. After the segment is down and then up again, we run a query and find QD process hangs when connecting to resource manager. Here is the backtrace when QD hangs: {code} * thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at rmcomm_AsyncComm.c:156 frame #2: 0x000101db85f5 postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) + 645 at rmcomm_SyncComm.c:122 frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780 frame #4: 0x000101db2d3c postgres`acquireResourceFromRM(index=, sessionid=12, slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, preferred_nodes_size=, max_seg_count_fix=, min_seg_count_fix=, errorbuf=, errorbufsize=) + 572 at rmcomm_QD2RM.c:742 frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, slice_size=5, iobytes=134217728, max_target_segment_num=1, min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 at pquery.c:796 frame #6: 0x000101e8c60f postgres`calculate_planner_segment_num(query=, resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, sliceNum=5) + 14287 at cdbdatalocality.c:4207 frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496 frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, cursorOptions=, boundParams=0x, resourceLife=QRL_ONCE) + 311 at planner.c:310 frame #9: 0x000101c8eb33 postgres`pg_plan_query(querytree=0x7f9c1a02a140, boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837 frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at postgres.c:911 frame #11: 0x000101c95699 postgres`exec_simple_query(query_string=0x7f9c1a028a30, seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671 frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754 frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 105 at postmaster.c:5889 frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484 frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at postmaster.c:2163 frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, argv=) + 5019 at postmaster.c:1454 frame #17: 0x000101bb1aa9 postgres`main(argc=9, argv=0x7f9c19c1eef0) + 1433 at main.c:209 frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1 thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 2163 at ic_udp.c:6251 frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 78 at pgsleep.c:43 frame #2: 0x000101db1a66 postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at rmcomm_QD2RM.c:1519 frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HAWQ-559) QD hangs when QE is killed after connected to QD
Chunling Wang created HAWQ-559: -- Summary: QD hangs when QE is killed after connected to QD Key: HAWQ-559 URL: https://issues.apache.org/jira/browse/HAWQ-559 Project: Apache HAWQ Issue Type: Bug Components: Dispatcher Reporter: Chunling Wang Assignee: Lei Chang When the first query finishes, the QE is still alive. Then we run the second query. After the thread of QD is created and bind to QE but not send data to QE, we kill this QE and find QD hangs. Here is the backtrace when QD hangs: * thread #1: tid = 0x1c4afd, 0x7fff890355be libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x00010745692c postgres`receiveChunksUDP [inlined] udpSignalPoll + 42 at ic_udp.c:2882 frame #2: 0x000107456902 postgres`receiveChunksUDP + 26 at ic_udp.c:2715 frame #3: 0x0001074568e8 postgres`receiveChunksUDP [inlined] waitOnCondition(timeout_us=25) + 82 at ic_udp.c:1599 frame #4: 0x000107456896 postgres`receiveChunksUDP(pTransportStates=0x7ff2a381ae48, pEntry=0x7ff2a18f2230, motNodeID=, srcRoute=0x7fff58c0ce96, conn=, inTeardown='\0') + 726 at ic_udp.c:4039 frame #5: 0x000107452a86 postgres`RecvTupleChunkFromAnyUDP [inlined] RecvTupleChunkFromAnyUDP_Internal + 498 at ic_udp.c:4146 frame #6: 0x000107452894 postgres`RecvTupleChunkFromAnyUDP(mlStates=, transportStates=, motNodeID=1, srcRoute=0x7fff58c0ce96) + 100 at ic_udp.c:4167 frame #7: 0x000107442254 postgres`RecvTupleFrom [inlined] processIncomingChunks(mlStates=0x7ff2a3812a30, transportStates=0x7ff2a381ae48, motNodeID=1, srcRoute=) + 34 at cdbmotion.c:684 frame #8: 0x000107442232 postgres`RecvTupleFrom(mlStates=0x7ff2a3812a30, transportStates=, motNodeID=1, tup_i=0x7fff58c0cf00, srcRoute=-100) + 370 at cdbmotion.c:610 frame #9: 0x0001071c8778 postgres`ExecMotion [inlined] execMotionUnsortedReceiver(node=) + 57 at nodeMotion.c:466 frame #10: 0x0001071c873f postgres`ExecMotion(node=) + 1071 at nodeMotion.c:298 frame #11: 0x0001071a4835 postgres`ExecProcNode(node=0x7ff2a38164b8) + 613 at execProcnode.c:999 frame #12: 0x0001071b9f82 postgres`ExecAgg + 104 at nodeAgg.c:1163 frame #13: 0x0001071b9f1a postgres`ExecAgg + 316 at nodeAgg.c:1693 frame #14: 0x0001071b9dde postgres`ExecAgg(node=0x7ff2a3815348) + 126 at nodeAgg.c:1138 frame #15: 0x0001071a4803 postgres`ExecProcNode(node=0x7ff2a3815348) + 563 at execProcnode.c:979 frame #16: 0x00010719ecfd postgres`ExecutePlan(estate=0x7ff2a3814e30, planstate=0x7ff2a3815348, operation=CMD_SELECT, numberTuples=0, direction=, dest=0x7ff2a28db178) + 1181 at execMain.c:3218 frame #17: 0x00010719e619 postgres`ExecutorRun(queryDesc=0x7ff2a3811f00, direction=ForwardScanDirection, count=0) + 569 at execMain.c:1213 frame #18: 0x0001072e7fc2 postgres`PortalRun + 14 at pquery.c:1649 frame #19: 0x0001072e7fb4 postgres`PortalRun(portal=0x7ff2a1893e30, count=, isTopLevel='\x01', dest=, altdest=0x7ff2a28db178, completionTag=0x7fff58c0d530) + 1124 at pquery.c:1471 frame #20: 0x0001072e4a8e postgres`exec_simple_query(query_string=0x7ff2a380fe30, seqServerHost=0x, seqServerPort=-1) + 2078 at postgres.c:1745 frame #21: 0x0001072e0c4c postgres`PostgresMain(argc=, argv=, username=0x7ff2a201bcf0) + 9404 at postgres.c:4754 frame #22: 0x00010729a002 postgres`ServerLoop [inlined] BackendRun + 105 at postmaster.c:5889 frame #23: 0x000107299f99 postgres`ServerLoop at postmaster.c:5484 frame #24: 0x000107299f99 postgres`ServerLoop + 9593 at postmaster.c:2163 frame #25: 0x000107296f3b postgres`PostmasterMain(argc=, argv=) + 5019 at postmaster.c:1454 frame #26: 0x000107200ca9 postgres`main(argc=9, argv=0x7ff2a141eef0) + 1433 at main.c:209 frame #27: 0x7fff95e8c5c9 libdyld.dylib`start + 1 thread #2: tid = 0x1c4afe, 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 frame #1: 0x00010744d8e3 postgres`rxThreadFunc(arg=) + 2163 at ic_udp.c:6251 frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176 frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13 thread #3: tid = 0x1c4b02, 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10 frame #1: 0x0001074ec47e postgres`pg_usleep(microsec=) + 78 at pgsleep.c:43 frame #2: 0x000107400c26 postgres`generateResourceRefreshHeartBeat(ar
[jira] [Updated] (HAWQ-559) QD hangs when QE is killed after connected to QD
[ https://issues.apache.org/jira/browse/HAWQ-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-559: --- Affects Version/s: 2.0.0 Environment: mac os X 10.10 > QD hangs when QE is killed after connected to QD > > > Key: HAWQ-559 > URL: https://issues.apache.org/jira/browse/HAWQ-559 > Project: Apache HAWQ > Issue Type: Bug > Components: Dispatcher >Affects Versions: 2.0.0 > Environment: mac os X 10.10 >Reporter: Chunling Wang >Assignee: Lei Chang > > When the first query finishes, the QE is still alive. Then we run the second > query. After the thread of QD is created and bind to QE but not send data to > QE, we kill this QE and find QD hangs. > Here is the backtrace when QD hangs: > * thread #1: tid = 0x1c4afd, 0x7fff890355be libsystem_kernel.dylib`poll + > 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP > * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 > frame #1: 0x00010745692c postgres`receiveChunksUDP [inlined] > udpSignalPoll + 42 at ic_udp.c:2882 > frame #2: 0x000107456902 postgres`receiveChunksUDP + 26 at > ic_udp.c:2715 > frame #3: 0x0001074568e8 postgres`receiveChunksUDP [inlined] > waitOnCondition(timeout_us=25) + 82 at ic_udp.c:1599 > frame #4: 0x000107456896 > postgres`receiveChunksUDP(pTransportStates=0x7ff2a381ae48, > pEntry=0x7ff2a18f2230, motNodeID=, > srcRoute=0x7fff58c0ce96, conn=, inTeardown='\0') + 726 at > ic_udp.c:4039 > frame #5: 0x000107452a86 postgres`RecvTupleChunkFromAnyUDP [inlined] > RecvTupleChunkFromAnyUDP_Internal + 498 at ic_udp.c:4146 > frame #6: 0x000107452894 > postgres`RecvTupleChunkFromAnyUDP(mlStates=, > transportStates=, motNodeID=1, srcRoute=0x7fff58c0ce96) + > 100 at ic_udp.c:4167 > frame #7: 0x000107442254 postgres`RecvTupleFrom [inlined] > processIncomingChunks(mlStates=0x7ff2a3812a30, > transportStates=0x7ff2a381ae48, motNodeID=1, srcRoute=) + 34 > at cdbmotion.c:684 > frame #8: 0x000107442232 > postgres`RecvTupleFrom(mlStates=0x7ff2a3812a30, > transportStates=, motNodeID=1, tup_i=0x7fff58c0cf00, > srcRoute=-100) + 370 at cdbmotion.c:610 > frame #9: 0x0001071c8778 postgres`ExecMotion [inlined] > execMotionUnsortedReceiver(node=) + 57 at nodeMotion.c:466 > frame #10: 0x0001071c873f postgres`ExecMotion(node=) + > 1071 at nodeMotion.c:298 > frame #11: 0x0001071a4835 > postgres`ExecProcNode(node=0x7ff2a38164b8) + 613 at execProcnode.c:999 > frame #12: 0x0001071b9f82 postgres`ExecAgg + 104 at nodeAgg.c:1163 > frame #13: 0x0001071b9f1a postgres`ExecAgg + 316 at nodeAgg.c:1693 > frame #14: 0x0001071b9dde postgres`ExecAgg(node=0x7ff2a3815348) + > 126 at nodeAgg.c:1138 > frame #15: 0x0001071a4803 > postgres`ExecProcNode(node=0x7ff2a3815348) + 563 at execProcnode.c:979 > frame #16: 0x00010719ecfd > postgres`ExecutePlan(estate=0x7ff2a3814e30, planstate=0x7ff2a3815348, > operation=CMD_SELECT, numberTuples=0, direction=, > dest=0x7ff2a28db178) + 1181 at execMain.c:3218 > frame #17: 0x00010719e619 > postgres`ExecutorRun(queryDesc=0x7ff2a3811f00, > direction=ForwardScanDirection, count=0) + 569 at execMain.c:1213 > frame #18: 0x0001072e7fc2 postgres`PortalRun + 14 at pquery.c:1649 > frame #19: 0x0001072e7fb4 > postgres`PortalRun(portal=0x7ff2a1893e30, count=, > isTopLevel='\x01', dest=, altdest=0x7ff2a28db178, > completionTag=0x7fff58c0d530) + 1124 at pquery.c:1471 > frame #20: 0x0001072e4a8e > postgres`exec_simple_query(query_string=0x7ff2a380fe30, > seqServerHost=0x, seqServerPort=-1) + 2078 at postgres.c:1745 > frame #21: 0x0001072e0c4c postgres`PostgresMain(argc=, > argv=, username=0x7ff2a201bcf0) + 9404 at postgres.c:4754 > frame #22: 0x00010729a002 postgres`ServerLoop [inlined] BackendRun + > 105 at postmaster.c:5889 > frame #23: 0x000107299f99 postgres`ServerLoop at postmaster.c:5484 > frame #24: 0x000107299f99 postgres`ServerLoop + 9593 at > postmaster.c:2163 > frame #25: 0x000107296f3b postgres`PostmasterMain(argc=, > argv=) + 5019 at postmaster.c:1454 > frame #26: 0x000107200ca9 postgres`main(argc=9, > argv=0x7ff2a141eef0) + 1433 at main.c:209 > frame #27: 0x7fff95e8c5c9 libdyld.dylib`start + 1 > thread #2: tid = 0x1c4afe, 0x7fff890355be libsystem_kernel.dylib`poll + > 10 > frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10 > frame #1: 0x00010744d8e3 postgres`rxThreadFunc(arg=) + > 2163 at ic_udp.c:6251 > frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131 > frame #3: 0x000
[jira] [Updated] (HAWQ-523) Dead code in executormgr_bind_executor_task()
[ https://issues.apache.org/jira/browse/HAWQ-523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-523: --- Summary: Dead code in executormgr_bind_executor_task() (was: dead code in executormgr_bind_executor_task()) > Dead code in executormgr_bind_executor_task() > - > > Key: HAWQ-523 > URL: https://issues.apache.org/jira/browse/HAWQ-523 > Project: Apache HAWQ > Issue Type: New Feature > Components: Dispatcher >Affects Versions: 2.0.0 >Reporter: Chunling Wang >Assignee: Lei Chang > > In executormgr.c, the code below would never access: > bool > executormgr_bind_executor_task(struct DispatchData *data, > QueryExecutor *executor, > > SegmentDatabaseDescriptor *desc, > struct DispatchTask > *task, > struct DispatchSlice > *slice) > { > ... > if (desc == NULL) > { > executor->health = QEH_ERROR; > return false; > } > ... > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HAWQ-524) Do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task()
[ https://issues.apache.org/jira/browse/HAWQ-524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-524: --- Summary: Do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task() (was: do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task() ) > Do not resolve the condition of 'executor->refResult = NULL' in > executormgr_bind_executor_task() > - > > Key: HAWQ-524 > URL: https://issues.apache.org/jira/browse/HAWQ-524 > Project: Apache HAWQ > Issue Type: Bug > Components: Dispatcher >Affects Versions: 2.0.0 >Reporter: Chunling Wang >Assignee: Lili Ma > Fix For: 2.0.0 > > > In executormgr.c, the code below should not be Assert(). The condition of > 'executor->refResult = NULL' should be catch. > bool > executormgr_bind_executor_task(struct DispatchData *data, > QueryExecutor *executor, > > SegmentDatabaseDescriptor *desc, > struct DispatchTask > *task, > struct DispatchSlice > *slice) > { > ... > Assert(executor->refResult != NULL); > ... > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HAWQ-539) Improve code coverage for dispatcher: connection_fail_after_gang_creation& create_cdb_dispath_result_object& dispmgt_concurrent_connect
[ https://issues.apache.org/jira/browse/HAWQ-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-539: --- Summary: Improve code coverage for dispatcher: connection_fail_after_gang_creation& create_cdb_dispath_result_object& dispmgt_concurrent_connect (was: Add fault injection for dispatcher: connection_fail_after_gang_creation& create_cdb_dispath_result_object& dispmgt_concurrent_connect) > Improve code coverage for dispatcher: connection_fail_after_gang_creation& > create_cdb_dispath_result_object& dispmgt_concurrent_connect > --- > > Key: HAWQ-539 > URL: https://issues.apache.org/jira/browse/HAWQ-539 > Project: Apache HAWQ > Issue Type: Sub-task > Components: Dispatcher >Reporter: Chunling Wang >Assignee: Lei Chang > > add three fault injections below: > 1. connection_fail_after_gang_creation > In function dispatcher_bind_executor() of dispatcher.c, we inject faults > before connection rebind. > #ifdef FAULT_INJECTOR > FaultInjector_InjectFaultIfSet( > >ConnectionFailAfterGangCreation, > >DDLNotSpecified, > >"", // databaseName > >""); // tableName > #endif > 2. create_cdb_dispath_result_object > In function cdbdisp_makeResult() of cdbdispatchresult.c, we inject > out-of-memory before calling PQExpBufferBroken(). > #ifdef FAULT_INJECTOR > FaultInjector_InjectFaultIfSet( > >CreateCdbDispathResultObject, > >DDLNotSpecified, > >"", // databaseName > >""); // tableName > #endif > 3. worker_manager_submit_job > Inject error in function workermgr_submit_job() of workermgr.c. > #ifdef FAULT_INJECTOR > FaultInjector_InjectFaultIfSet( > >WorkerManagerSubmitJob, > >DDLNotSpecified, > >"", // databaseName > >""); // tableName > #endif -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HAWQ-539) Add fault injection for dispatcher: connection_fail_after_gang_creation& create_cdb_dispath_result_object& dispmgt_concurrent_connect
Chunling Wang created HAWQ-539: -- Summary: Add fault injection for dispatcher: connection_fail_after_gang_creation& create_cdb_dispath_result_object& dispmgt_concurrent_connect Key: HAWQ-539 URL: https://issues.apache.org/jira/browse/HAWQ-539 Project: Apache HAWQ Issue Type: Sub-task Components: Dispatcher Reporter: Chunling Wang Assignee: Lei Chang add three fault injections below: 1. connection_fail_after_gang_creation In function dispatcher_bind_executor() of dispatcher.c, we inject faults before connection rebind. #ifdef FAULT_INJECTOR FaultInjector_InjectFaultIfSet( ConnectionFailAfterGangCreation, DDLNotSpecified, "", // databaseName ""); // tableName #endif 2. create_cdb_dispath_result_object In function cdbdisp_makeResult() of cdbdispatchresult.c, we inject out-of-memory before calling PQExpBufferBroken(). #ifdef FAULT_INJECTOR FaultInjector_InjectFaultIfSet( CreateCdbDispathResultObject, DDLNotSpecified, "", // databaseName ""); // tableName #endif 3. worker_manager_submit_job Inject error in function workermgr_submit_job() of workermgr.c. #ifdef FAULT_INJECTOR FaultInjector_InjectFaultIfSet( WorkerManagerSubmitJob, DDLNotSpecified, "", // databaseName ""); // tableName #endif -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HAWQ-538) Add fault injection for dispatcher
Chunling Wang created HAWQ-538: -- Summary: Add fault injection for dispatcher Key: HAWQ-538 URL: https://issues.apache.org/jira/browse/HAWQ-538 Project: Apache HAWQ Issue Type: New Feature Components: Dispatcher Reporter: Chunling Wang Assignee: Lei Chang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HAWQ-524) do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task()
[ https://issues.apache.org/jira/browse/HAWQ-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192821#comment-15192821 ] Chunling Wang commented on HAWQ-524: In cdbdispatcheresult.c, when dispatchResult->resultbuf == NULL, there is no need to free the PGresult objects again in function cdbdisp_resetResult(). Change the code like below: void cdbdisp_resetResult(CdbDispatchResult *dispatchResult) { if (dispatchResult->resultbuf != NULL) { PQExpBuffer buf = dispatchResult->resultbuf; PGresult **begp = (PGresult **)buf->data; PGresult **endp = (PGresult **)(buf->data + buf->len); PGresult **p; /* Free the PGresult objects. */ for (p = begp; p < endp; ++p) { Assert(*p != NULL); PQclear(*p); } } ... } > do not resolve the condition of 'executor->refResult = NULL' in > executormgr_bind_executor_task() > - > > Key: HAWQ-524 > URL: https://issues.apache.org/jira/browse/HAWQ-524 > Project: Apache HAWQ > Issue Type: Bug > Components: Dispatcher >Affects Versions: 2.0.0 >Reporter: Chunling Wang >Assignee: Lei Chang > > In executormgr.c, the code below should not be Assert(). The condition of > 'executor->refResult = NULL' should be catch. > bool > executormgr_bind_executor_task(struct DispatchData *data, > QueryExecutor *executor, > > SegmentDatabaseDescriptor *desc, > struct DispatchTask > *task, > struct DispatchSlice > *slice) > { > ... > Assert(executor->refResult != NULL); > ... > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HAWQ-524) do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task()
[ https://issues.apache.org/jira/browse/HAWQ-524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-524: --- Affects Version/s: 2.0.0 > do not resolve the condition of 'executor->refResult = NULL' in > executormgr_bind_executor_task() > - > > Key: HAWQ-524 > URL: https://issues.apache.org/jira/browse/HAWQ-524 > Project: Apache HAWQ > Issue Type: Bug > Components: Dispatcher >Affects Versions: 2.0.0 >Reporter: Chunling Wang >Assignee: Lei Chang > > In executormgr.c, the code below should not be Assert(). The condition of > 'executor->refResult = NULL' should be catch. > bool > executormgr_bind_executor_task(struct DispatchData *data, > QueryExecutor *executor, > > SegmentDatabaseDescriptor *desc, > struct DispatchTask > *task, > struct DispatchSlice > *slice) > { > ... > Assert(executor->refResult != NULL); > ... > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HAWQ-524) do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task()
Chunling Wang created HAWQ-524: -- Summary: do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task() Key: HAWQ-524 URL: https://issues.apache.org/jira/browse/HAWQ-524 Project: Apache HAWQ Issue Type: Bug Components: Dispatcher Reporter: Chunling Wang Assignee: Lei Chang In executormgr.c, the code below should not be Assert(). The condition of 'executor->refResult = NULL' should be catch. bool executormgr_bind_executor_task(struct DispatchData *data, QueryExecutor *executor, SegmentDatabaseDescriptor *desc, struct DispatchTask *task, struct DispatchSlice *slice) { ... Assert(executor->refResult != NULL); ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HAWQ-523) dead code in executormgr_bind_executor_task()
[ https://issues.apache.org/jira/browse/HAWQ-523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chunling Wang updated HAWQ-523: --- Affects Version/s: 2.0.0 > dead code in executormgr_bind_executor_task() > - > > Key: HAWQ-523 > URL: https://issues.apache.org/jira/browse/HAWQ-523 > Project: Apache HAWQ > Issue Type: New Feature > Components: Dispatcher >Affects Versions: 2.0.0 >Reporter: Chunling Wang >Assignee: Lei Chang > > In executormgr.c, the code below would never access: > bool > executormgr_bind_executor_task(struct DispatchData *data, > QueryExecutor *executor, > > SegmentDatabaseDescriptor *desc, > struct DispatchTask > *task, > struct DispatchSlice > *slice) > { > ... > if (desc == NULL) > { > executor->health = QEH_ERROR; > return false; > } > ... > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HAWQ-523) dead code in executormgr_bind_executor_task()
Chunling Wang created HAWQ-523: -- Summary: dead code in executormgr_bind_executor_task() Key: HAWQ-523 URL: https://issues.apache.org/jira/browse/HAWQ-523 Project: Apache HAWQ Issue Type: New Feature Components: Dispatcher Reporter: Chunling Wang Assignee: Lei Chang In executormgr.c, the code below would never access: bool executormgr_bind_executor_task(struct DispatchData *data, QueryExecutor *executor, SegmentDatabaseDescriptor *desc, struct DispatchTask *task, struct DispatchSlice *slice) { ... if (desc == NULL) { executor->health = QEH_ERROR; return false; } ... } -- This message was sent by Atlassian JIRA (v6.3.4#6332)