[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-564:
---
Description: 
When first inject panic in QE process, we run a query and segment is down. 
After the segment is up, we run another query and get correct answer. Then we 
inject the same panic second time. After the segment is down and then up again, 
we run a query and find QD process hangs when connecting to resource manager. 
Here is the backtrace when QD hangs:
{code}
* thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
rmcomm_AsyncComm.c:156
frame #2: 0x000101db85f5 
postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) 
+ 645 at rmcomm_SyncComm.c:122
frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
frame #4: 0x000101db2d3c 
postgres`acquireResourceFromRM(index=, sessionid=12, 
slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
preferred_nodes_size=, max_seg_count_fix=, 
min_seg_count_fix=, errorbuf=, 
errorbufsize=) + 572 at rmcomm_QD2RM.c:742
frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
slice_size=5, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
at pquery.c:796
frame #6: 0x000101e8c60f 
postgres`calculate_planner_segment_num(query=, 
resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, 
sliceNum=5) + 14287 at cdbdatalocality.c:4207
frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
cursorOptions=, boundParams=0x, 
resourceLife=QRL_ONCE) + 311 at planner.c:310
frame #9: 0x000101c8eb33 
postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
postgres.c:911
frame #11: 0x000101c95699 
postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
argv=) + 5019 at postmaster.c:1454
frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
argv=0x7f9c19c1eef0) + 1433 at main.c:209
frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
2163 at ic_udp.c:6251
frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select 
+ 10
frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
78 at pgsleep.c:43
frame #2: 0x000101db1a66 
postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
rmcomm_QD2RM.c:1519
frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
{code}

And here is the operations:
1. Before injection, get query answer correctly.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)
{code}
2. Inject panic, fault triggered, and segment is down.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
ERROR:  fault triggered, fault 

[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-564:
---
Affects Version/s: 2.0.0

> QD hangs when connecting to resource manager
> 
>
> Key: HAWQ-564
> URL: https://issues.apache.org/jira/browse/HAWQ-564
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Resource Manager
>Affects Versions: 2.0.0
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> When first inject panic in QE process, we run a query and segment is down. 
> After the segment is up, we run another query and get correct answer. Then we 
> inject the same panic second time. After the segment is down and then up 
> again, we run a query and find QD process hangs when connecting to resource 
> manager. Here is the backtrace when QD hangs:
> {code}
> * thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
> 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>   * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
> frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
> rmcomm_AsyncComm.c:156
> frame #2: 0x000101db85f5 
> postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
> sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, 
> exprecvmsgid=2307, recvsmb=, errorbuf=0x00010230c1a0, 
> errorbufsize=) + 645 at rmcomm_SyncComm.c:122
> frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
> callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
> sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
> errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
> frame #4: 0x000101db2d3c 
> postgres`acquireResourceFromRM(index=, sessionid=12, 
> slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
> preferred_nodes_size=, max_seg_count_fix=, 
> min_seg_count_fix=, errorbuf=, 
> errorbufsize=) + 572 at rmcomm_QD2RM.c:742
> frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
> slice_size=5, iobytes=134217728, max_target_segment_num=1, 
> min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
> at pquery.c:796
> frame #6: 0x000101e8c60f 
> postgres`calculate_planner_segment_num(query=, 
> resourceLife=QRL_ONCE, fullRangeTable=, 
> intoPolicy=, sliceNum=5) + 14287 at cdbdatalocality.c:4207
> frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
> frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
> cursorOptions=, boundParams=0x, 
> resourceLife=QRL_ONCE) + 311 at planner.c:310
> frame #9: 0x000101c8eb33 
> postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
> boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
> frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
> postgres.c:911
> frame #11: 0x000101c95699 
> postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
> seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
> frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
> argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
> frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
> 105 at postmaster.c:5889
> frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
> frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
> postmaster.c:2163
> frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
> argv=) + 5019 at postmaster.c:1454
> frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
> argv=0x7f9c19c1eef0) + 1433 at main.c:209
> frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1
>   thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 
> 10
> frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
> frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
> 2163 at ic_udp.c:6251
> frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
> frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
> frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
>   thread #3: tid = 0x21d9c2, 0x7fff890343f6 
> libsystem_kernel.dylib`__select + 10
> frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
> frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
> 78 at pgsleep.c:43
> frame #2: 0x000101db1a66 
> postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
> rmcomm_QD2RM.c:1519
> frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
> frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
> frame #5: 

[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-564:
---
Description: 
When first inject panic in QE process, we run a query and segment is down. 
After the segment is up, we run another query and get correct answer. Then we 
inject the same panic second time. After the segment is down and then up again, 
we run a query and find QD process hangs when connecting to resource manager. 
Here is the backtrace when QD hangs:
{code}
* thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
rmcomm_AsyncComm.c:156
frame #2: 0x000101db85f5 
postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) 
+ 645 at rmcomm_SyncComm.c:122
frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
frame #4: 0x000101db2d3c 
postgres`acquireResourceFromRM(index=, sessionid=12, 
slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
preferred_nodes_size=, max_seg_count_fix=, 
min_seg_count_fix=, errorbuf=, 
errorbufsize=) + 572 at rmcomm_QD2RM.c:742
frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
slice_size=5, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
at pquery.c:796
frame #6: 0x000101e8c60f 
postgres`calculate_planner_segment_num(query=, 
resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, 
sliceNum=5) + 14287 at cdbdatalocality.c:4207
frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
cursorOptions=, boundParams=0x, 
resourceLife=QRL_ONCE) + 311 at planner.c:310
frame #9: 0x000101c8eb33 
postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
postgres.c:911
frame #11: 0x000101c95699 
postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
argv=) + 5019 at postmaster.c:1454
frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
argv=0x7f9c19c1eef0) + 1433 at main.c:209
frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
2163 at ic_udp.c:6251
frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select 
+ 10
frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
78 at pgsleep.c:43
frame #2: 0x000101db1a66 
postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
rmcomm_QD2RM.c:1519
frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
{code}

And here is the operations:
1. Before injection, get query answer correctly.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)
{code}
2. Inject panic, fault triggered, and segment is down.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
ERROR:  fault triggered, fault 

[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-564:
---
Description: 
When first inject panic in QE process, we run a query and segment is down. 
After the segment is up, we run another query and get correct answer. Then we 
inject the same panic second time. After the segment is down and then up again, 
we run a query and find QD process hangs when connecting to resource manager. 
Here is the backtrace when QD hangs:
{code}
* thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
rmcomm_AsyncComm.c:156
frame #2: 0x000101db85f5 
postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) 
+ 645 at rmcomm_SyncComm.c:122
frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
frame #4: 0x000101db2d3c 
postgres`acquireResourceFromRM(index=, sessionid=12, 
slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
preferred_nodes_size=, max_seg_count_fix=, 
min_seg_count_fix=, errorbuf=, 
errorbufsize=) + 572 at rmcomm_QD2RM.c:742
frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
slice_size=5, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
at pquery.c:796
frame #6: 0x000101e8c60f 
postgres`calculate_planner_segment_num(query=, 
resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, 
sliceNum=5) + 14287 at cdbdatalocality.c:4207
frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
cursorOptions=, boundParams=0x, 
resourceLife=QRL_ONCE) + 311 at planner.c:310
frame #9: 0x000101c8eb33 
postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
postgres.c:911
frame #11: 0x000101c95699 
postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
argv=) + 5019 at postmaster.c:1454
frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
argv=0x7f9c19c1eef0) + 1433 at main.c:209
frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
2163 at ic_udp.c:6251
frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select 
+ 10
frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
78 at pgsleep.c:43
frame #2: 0x000101db1a66 
postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
rmcomm_QD2RM.c:1519
frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
{code}

And here is the operations:
1. Before injection, get query answer correctly.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)
{code}
2. Inject panic, fault triggered, and segment is down.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
ERROR:  fault triggered, fault 

[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-564:
---
Description: 
When first inject panic in QE process, we run a query and segment is down. 
After the segment is up, we run another query and get correct answer. Then we 
inject the same panic second time. After the segment is down and then up again, 
we run a query and find QD process hangs when connecting to resource manager. 
Here is the backtrace when QD hangs:
{code}
* thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
rmcomm_AsyncComm.c:156
frame #2: 0x000101db85f5 
postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) 
+ 645 at rmcomm_SyncComm.c:122
frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
frame #4: 0x000101db2d3c 
postgres`acquireResourceFromRM(index=, sessionid=12, 
slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
preferred_nodes_size=, max_seg_count_fix=, 
min_seg_count_fix=, errorbuf=, 
errorbufsize=) + 572 at rmcomm_QD2RM.c:742
frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
slice_size=5, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
at pquery.c:796
frame #6: 0x000101e8c60f 
postgres`calculate_planner_segment_num(query=, 
resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, 
sliceNum=5) + 14287 at cdbdatalocality.c:4207
frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
cursorOptions=, boundParams=0x, 
resourceLife=QRL_ONCE) + 311 at planner.c:310
frame #9: 0x000101c8eb33 
postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
postgres.c:911
frame #11: 0x000101c95699 
postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
argv=) + 5019 at postmaster.c:1454
frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
argv=0x7f9c19c1eef0) + 1433 at main.c:209
frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
2163 at ic_udp.c:6251
frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select 
+ 10
frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
78 at pgsleep.c:43
frame #2: 0x000101db1a66 
postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
rmcomm_QD2RM.c:1519
frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
{code}

And here is the operations:
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)

dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
ERROR:  fault triggered, fault name:'fail_qe_when_do_query' fault type:'panic' 
(faultinjector.c:656)  (seg0 localhost:4 pid=26936)
dispatch=# select count(*)