[jira] [Commented] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-22 Thread Chunling Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206095#comment-15206095
 ] 

Chunling Wang commented on HAWQ-564:


And 'kill -6' can cause same result.

> QD hangs when connecting to resource manager
> 
>
> Key: HAWQ-564
> URL: https://issues.apache.org/jira/browse/HAWQ-564
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Resource Manager
>Affects Versions: 2.0.0
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> When first inject panic in QE process, we run a query and segment is down. 
> After the segment is up, we run another query and get correct answer. Then we 
> inject the same panic second time. After the segment is down and then up 
> again, we run a query and find QD process hangs when connecting to resource 
> manager. Here is the backtrace when QD hangs:
> {code}
> * thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
> 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>   * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
> frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
> rmcomm_AsyncComm.c:156
> frame #2: 0x000101db85f5 
> postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
> sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, 
> exprecvmsgid=2307, recvsmb=, errorbuf=0x00010230c1a0, 
> errorbufsize=) + 645 at rmcomm_SyncComm.c:122
> frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
> callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
> sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
> errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
> frame #4: 0x000101db2d3c 
> postgres`acquireResourceFromRM(index=, sessionid=12, 
> slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
> preferred_nodes_size=, max_seg_count_fix=, 
> min_seg_count_fix=, errorbuf=, 
> errorbufsize=) + 572 at rmcomm_QD2RM.c:742
> frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
> slice_size=5, iobytes=134217728, max_target_segment_num=1, 
> min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
> at pquery.c:796
> frame #6: 0x000101e8c60f 
> postgres`calculate_planner_segment_num(query=, 
> resourceLife=QRL_ONCE, fullRangeTable=, 
> intoPolicy=, sliceNum=5) + 14287 at cdbdatalocality.c:4207
> frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
> frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
> cursorOptions=, boundParams=0x, 
> resourceLife=QRL_ONCE) + 311 at planner.c:310
> frame #9: 0x000101c8eb33 
> postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
> boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
> frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
> postgres.c:911
> frame #11: 0x000101c95699 
> postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
> seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
> frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
> argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
> frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
> 105 at postmaster.c:5889
> frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
> frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
> postmaster.c:2163
> frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
> argv=) + 5019 at postmaster.c:1454
> frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
> argv=0x7f9c19c1eef0) + 1433 at main.c:209
> frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1
>   thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 
> 10
> frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
> frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
> 2163 at ic_udp.c:6251
> frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
> frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
> frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
>   thread #3: tid = 0x21d9c2, 0x7fff890343f6 
> libsystem_kernel.dylib`__select + 10
> frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
> frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
> 78 at pgsleep.c:43
> frame #2: 0x000101db1a66 
> postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
> rmcomm_QD2RM.c:1519
> frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
> frame #4: 0x7fff95e82279 lib

[jira] [Created] (HAWQ-572) Improve code coverage for dispatcher: fail_qe_after_connection & fail_qe_when_do_query & fail_qe_when_begin_parquet_scan

2016-03-22 Thread Chunling Wang (JIRA)
Chunling Wang created HAWQ-572:
--

 Summary: Improve code coverage for dispatcher: 
fail_qe_after_connection & fail_qe_when_do_query & 
fail_qe_when_begin_parquet_scan
 Key: HAWQ-572
 URL: https://issues.apache.org/jira/browse/HAWQ-572
 Project: Apache HAWQ
  Issue Type: Sub-task
  Components: Dispatcher
Reporter: Chunling Wang
Assignee: Lei Chang


Add those fault injections:
1. fail_qe_after_connection 
2. fail_qe_when_do_query 
3. fail_qe_when_begin_parquet_scan
And add test cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HAWQ-568) After query finished, kill a QE but can still recv() data from this QE socket

2016-03-21 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-568:
---
Summary: After query finished, kill a QE but can still recv() data from 
this QE socket  (was: After query finished, kill a QE but can still recv() from 
this QE socket)

> After query finished, kill a QE but can still recv() data from this QE socket
> -
>
> Key: HAWQ-568
> URL: https://issues.apache.org/jira/browse/HAWQ-568
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Affects Versions: 2.0.0
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> After query finished, we kill a QE and other QEs remain in QE pool. When 
> check the connection to this QE is whether alive, we use recv() to this QE 
> socket, but can still receive data.
> 1. Run a query and remain some QEs.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
> test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
>  count
> ---
>   3725
> (1 row)
> {code}
> {code}
> $ ps -ef|grep postgres
>   501 55701 1   0  5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
> --silent-mode=true
>   501 55702 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, master 
> logger process
>   501 55705 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, stats 
> collector process
>   501 55706 55701   0  5:38下午 ?? 0:00.04 postgres: port  5432, writer 
> process
>   501 55707 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, 
> checkpoint process
>   501 55708 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, 
> seqserver process
>   501 55709 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, WAL 
> Send Server process
>   501 55710 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, DFS 
> Metadata Cache process
>   501 55711 55701   0  5:38下午 ?? 0:00.26 postgres: port  5432, master 
> resource manager
>   501 55727 1   0  5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 
> --silent-mode=true
>   501 55728 55727   0  5:38下午 ?? 0:00.06 postgres: port 4, logger 
> process
>   501 55731 55727   0  5:38下午 ?? 0:00.00 postgres: port 4, stats 
> collector process
>   501 55732 55727   0  5:38下午 ?? 0:00.04 postgres: port 4, writer 
> process
>   501 55733 55727   0  5:38下午 ?? 0:00.01 postgres: port 4, 
> checkpoint process
>   501 55734 55727   0  5:38下午 ?? 0:00.09 postgres: port 4, 
> segment resource manager
>   501 55741 55748   0  5:38下午 ?? 0:00.05 postgres: port  5432, 
> wangchunling dispatch [local] con12 cmd6 idle [local]
>   501 55743 55727   0  5:38下午 ?? 0:00.36 postgres: port 4, 
> wangchunling dispatch 127.0.0.1(50800) con12 seg0 idle
>   501 55770 55727   0  5:43下午 ?? 0:00.12 postgres: port 4, 
> wangchunling dispatch 127.0.0.1(50853) con12 seg0 idle
>   501 55771 55727   0  5:44下午 ?? 0:00.11 postgres: port 4, 
> wangchunling dispatch 127.0.0.1(50855) con12 seg0 idle
>   501 55774 26980   0  5:44下午 ttys0080:00.00 grep postgres
> {code}
> 2. Kill one QE.
> {code}
> $ kill 55771
> $ ps -ef|grep postgres
>   501 55701 1   0  5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
> --silent-mode=true
>   501 55702 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, master 
> logger process
>   501 55705 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, stats 
> collector process
>   501 55706 55701   0  5:38下午 ?? 0:00.04 postgres: port  5432, writer 
> process
>   501 55707 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, 
> checkpoint process
>   501 55708 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, 
> seqserver process
>   501 55709 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, WAL 
> Send Server process
>   501 55710 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, DFS 
> Metadata Cache process
>   501 55711 55701   0  5:38下午 ?? 0:00.27 postgres: port  5432, master 
> resource manager
>   501 55727 1   0  5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 
> --silent-mode=true
>   501 55728 55727   0  5:38下午 ?? 0:00.06 postgres: port 4, logger 
> process
>   501 55731 55727   0  5:38下午 ?? 0:00.00 postgres: port 4, stats 
> collector process
>   501 55732 55727   0  5:38下午 ?? 0:00.04 postgres: port 4, writer 
> process
>   501 

[jira] [Created] (HAWQ-568) After query finished, kill a QE but can still recv() from this QE socket

2016-03-21 Thread Chunling Wang (JIRA)
Chunling Wang created HAWQ-568:
--

 Summary: After query finished, kill a QE but can still recv() from 
this QE socket
 Key: HAWQ-568
 URL: https://issues.apache.org/jira/browse/HAWQ-568
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Dispatcher
Reporter: Chunling Wang
Assignee: Lei Chang


After query finished, we kill a QE and other QEs remain in QE pool. When check 
the connection to this QE is whether alive, we use recv() to this QE socket, 
but can still receive data.
1. Run a query and remain some QEs.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)
{code}
{code}
$ ps -ef|grep postgres
  501 55701 1   0  5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres 
-D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
--silent-mode=true
  501 55702 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, master 
logger process
  501 55705 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, stats 
collector process
  501 55706 55701   0  5:38下午 ?? 0:00.04 postgres: port  5432, writer 
process
  501 55707 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, 
checkpoint process
  501 55708 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, 
seqserver process
  501 55709 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, WAL Send 
Server process
  501 55710 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, DFS 
Metadata Cache process
  501 55711 55701   0  5:38下午 ?? 0:00.26 postgres: port  5432, master 
resource manager
  501 55727 1   0  5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres 
-D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 
--silent-mode=true
  501 55728 55727   0  5:38下午 ?? 0:00.06 postgres: port 4, logger 
process
  501 55731 55727   0  5:38下午 ?? 0:00.00 postgres: port 4, stats 
collector process
  501 55732 55727   0  5:38下午 ?? 0:00.04 postgres: port 4, writer 
process
  501 55733 55727   0  5:38下午 ?? 0:00.01 postgres: port 4, 
checkpoint process
  501 55734 55727   0  5:38下午 ?? 0:00.09 postgres: port 4, segment 
resource manager
  501 55741 55748   0  5:38下午 ?? 0:00.05 postgres: port  5432, 
wangchunling dispatch [local] con12 cmd6 idle [local]
  501 55743 55727   0  5:38下午 ?? 0:00.36 postgres: port 4, 
wangchunling dispatch 127.0.0.1(50800) con12 seg0 idle
  501 55770 55727   0  5:43下午 ?? 0:00.12 postgres: port 4, 
wangchunling dispatch 127.0.0.1(50853) con12 seg0 idle
  501 55771 55727   0  5:44下午 ?? 0:00.11 postgres: port 4, 
wangchunling dispatch 127.0.0.1(50855) con12 seg0 idle
  501 55774 26980   0  5:44下午 ttys0080:00.00 grep postgres
{code}
2. Kill one QE.
{code}
$ kill 55771
$ ps -ef|grep postgres
  501 55701 1   0  5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres 
-D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
--silent-mode=true
  501 55702 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, master 
logger process
  501 55705 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, stats 
collector process
  501 55706 55701   0  5:38下午 ?? 0:00.04 postgres: port  5432, writer 
process
  501 55707 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, 
checkpoint process
  501 55708 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, 
seqserver process
  501 55709 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, WAL Send 
Server process
  501 55710 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, DFS 
Metadata Cache process
  501 55711 55701   0  5:38下午 ?? 0:00.27 postgres: port  5432, master 
resource manager
  501 55727 1   0  5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres 
-D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 
--silent-mode=true
  501 55728 55727   0  5:38下午 ?? 0:00.06 postgres: port 4, logger 
process
  501 55731 55727   0  5:38下午 ?? 0:00.00 postgres: port 4, stats 
collector process
  501 55732 55727   0  5:38下午 ?? 0:00.04 postgres: port 4, writer 
process
  501 55733 55727   0  5:38下午 ?? 0:00.01 postgres: port 4, 
checkpoint process
  501 55734 55727   0  5:38下午 ?? 0:00.09 postgres: port 4, segment 
resource manager
  501 55741 55748   0  5:38下午 ?? 0:00.05 postgres: port  5432, 
wangchunling dispatch [local] con12 cmd6 idle [local]
  501 55743 55727   0  5:38下午 ?? 0:00.36 postgres: port 4, 
wangchunling dispatch 127.0.0.1(50800) con12 seg0 idle
  501 55770 55727   0  5:43下午 ?? 0:00.12 postgres: port 4, 
wangchunling dispatch 127.0.0.1(50853) con12 seg0 idle
  501 55776 269

[jira] [Updated] (HAWQ-568) After query finished, kill a QE but can still recv() from this QE socket

2016-03-21 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-568:
---
Affects Version/s: 2.0.0

> After query finished, kill a QE but can still recv() from this QE socket
> 
>
> Key: HAWQ-568
> URL: https://issues.apache.org/jira/browse/HAWQ-568
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Affects Versions: 2.0.0
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> After query finished, we kill a QE and other QEs remain in QE pool. When 
> check the connection to this QE is whether alive, we use recv() to this QE 
> socket, but can still receive data.
> 1. Run a query and remain some QEs.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
> test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
>  count
> ---
>   3725
> (1 row)
> {code}
> {code}
> $ ps -ef|grep postgres
>   501 55701 1   0  5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
> --silent-mode=true
>   501 55702 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, master 
> logger process
>   501 55705 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, stats 
> collector process
>   501 55706 55701   0  5:38下午 ?? 0:00.04 postgres: port  5432, writer 
> process
>   501 55707 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, 
> checkpoint process
>   501 55708 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, 
> seqserver process
>   501 55709 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, WAL 
> Send Server process
>   501 55710 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, DFS 
> Metadata Cache process
>   501 55711 55701   0  5:38下午 ?? 0:00.26 postgres: port  5432, master 
> resource manager
>   501 55727 1   0  5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 
> --silent-mode=true
>   501 55728 55727   0  5:38下午 ?? 0:00.06 postgres: port 4, logger 
> process
>   501 55731 55727   0  5:38下午 ?? 0:00.00 postgres: port 4, stats 
> collector process
>   501 55732 55727   0  5:38下午 ?? 0:00.04 postgres: port 4, writer 
> process
>   501 55733 55727   0  5:38下午 ?? 0:00.01 postgres: port 4, 
> checkpoint process
>   501 55734 55727   0  5:38下午 ?? 0:00.09 postgres: port 4, 
> segment resource manager
>   501 55741 55748   0  5:38下午 ?? 0:00.05 postgres: port  5432, 
> wangchunling dispatch [local] con12 cmd6 idle [local]
>   501 55743 55727   0  5:38下午 ?? 0:00.36 postgres: port 4, 
> wangchunling dispatch 127.0.0.1(50800) con12 seg0 idle
>   501 55770 55727   0  5:43下午 ?? 0:00.12 postgres: port 4, 
> wangchunling dispatch 127.0.0.1(50853) con12 seg0 idle
>   501 55771 55727   0  5:44下午 ?? 0:00.11 postgres: port 4, 
> wangchunling dispatch 127.0.0.1(50855) con12 seg0 idle
>   501 55774 26980   0  5:44下午 ttys0080:00.00 grep postgres
> {code}
> 2. Kill one QE.
> {code}
> $ kill 55771
> $ ps -ef|grep postgres
>   501 55701 1   0  5:38下午 ?? 0:00.38 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
> --silent-mode=true
>   501 55702 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, master 
> logger process
>   501 55705 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, stats 
> collector process
>   501 55706 55701   0  5:38下午 ?? 0:00.04 postgres: port  5432, writer 
> process
>   501 55707 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, 
> checkpoint process
>   501 55708 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, 
> seqserver process
>   501 55709 55701   0  5:38下午 ?? 0:00.01 postgres: port  5432, WAL 
> Send Server process
>   501 55710 55701   0  5:38下午 ?? 0:00.00 postgres: port  5432, DFS 
> Metadata Cache process
>   501 55711 55701   0  5:38下午 ?? 0:00.27 postgres: port  5432, master 
> resource manager
>   501 55727 1   0  5:38下午 ?? 0:00.52 /usr/local/hawq/bin/postgres 
> -D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 
> --silent-mode=true
>   501 55728 55727   0  5:38下午 ?? 0:00.06 postgres: port 4, logger 
> process
>   501 55731 55727   0  5:38下午 ?? 0:00.00 postgres: port 4, stats 
> collector process
>   501 55732 55727   0  5:38下午 ?? 0:00.04 postgres: port 4, writer 
> process
>   501 55733 55727   0  5:38下午 ?? 0:00.01 postgres: port 4, 
> checkpoint process
>   501 55734 55727   0  5:38下午 ?? 0:00.09 postgres: port 40

[jira] [Commented] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-21 Thread Chunling Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15203864#comment-15203864
 ] 

Chunling Wang commented on HAWQ-564:


There is another way to cause this bug without fault injection.
1. First run query and get some QEs.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)
{code}

{code}
$ ps -ef|grep postgres
  501 30190 1   0  2:34下午 ?? 0:00.31 /usr/local/hawq/bin/postgres 
-D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
--silent-mode=true
  501 30191 30190   0  2:34下午 ?? 0:00.01 postgres: port  5432, master 
logger process
  501 30194 30190   0  2:34下午 ?? 0:00.00 postgres: port  5432, stats 
collector process
  501 30195 30190   0  2:34下午 ?? 0:00.01 postgres: port  5432, writer 
process
  501 30196 30190   0  2:34下午 ?? 0:00.00 postgres: port  5432, 
checkpoint process
  501 30197 30190   0  2:34下午 ?? 0:00.00 postgres: port  5432, 
seqserver process
  501 30198 30190   0  2:34下午 ?? 0:00.00 postgres: port  5432, WAL Send 
Server process
  501 30199 30190   0  2:34下午 ?? 0:00.00 postgres: port  5432, DFS 
Metadata Cache process
  501 30200 30190   0  2:34下午 ?? 0:00.07 postgres: port  5432, master 
resource manager
  501 30216 1   0  2:34下午 ?? 0:00.37 /usr/local/hawq/bin/postgres 
-D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 
--silent-mode=true
  501 30217 30216   0  2:34下午 ?? 0:00.02 postgres: port 4, logger 
process
  501 30220 30216   0  2:34下午 ?? 0:00.00 postgres: port 4, stats 
collector process
  501 30221 30216   0  2:34下午 ?? 0:00.01 postgres: port 4, writer 
process
  501 30222 30216   0  2:34下午 ?? 0:00.00 postgres: port 4, 
checkpoint process
  501 30223 30216   0  2:34下午 ?? 0:00.03 postgres: port 4, segment 
resource manager
  501 30231 30190   0  2:35下午 ?? 0:00.03 postgres: port  5432, 
wangchunling dispatch [local] con12 cmd6 idle [local]
  501 30235 30216   0  2:35下午 ?? 0:00.13 postgres: port 4, 
wangchunling dispatch 127.0.0.1(65051) con12 seg0 idle
  501 30239 30216   0  2:35下午 ?? 0:00.06 postgres: port 4, 
wangchunling dispatch 127.0.0.1(65061) con12 seg0 idle
  501 30240 30216   0  2:35下午 ?? 0:00.06 postgres: port 4, 
wangchunling dispatch 127.0.0.1(65063) con12 seg0 idle
  501 30242 99560   0  2:36下午 ttys0000:00.00 grep postgres
{code}

2. Kill some QE and there is no QE.
{code}
$ kill -9 30235
$ ps -ef|grep postgres
  501 30190 1   0  2:34下午 ?? 0:00.32 /usr/local/hawq/bin/postgres 
-D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
--silent-mode=true
  501 30191 30190   0  2:34下午 ?? 0:00.01 postgres: port  5432, master 
logger process
  501 30194 30190   0  2:34下午 ?? 0:00.00 postgres: port  5432, stats 
collector process
  501 30195 30190   0  2:34下午 ?? 0:00.01 postgres: port  5432, writer 
process
  501 30196 30190   0  2:34下午 ?? 0:00.00 postgres: port  5432, 
checkpoint process
  501 30197 30190   0  2:34下午 ?? 0:00.00 postgres: port  5432, 
seqserver process
  501 30198 30190   0  2:34下午 ?? 0:00.00 postgres: port  5432, WAL Send 
Server process
  501 30199 30190   0  2:34下午 ?? 0:00.00 postgres: port  5432, DFS 
Metadata Cache process
  501 30200 30190   0  2:34下午 ?? 0:00.08 postgres: port  5432, master 
resource manager
  501 30216 1   0  2:34下午 ?? 0:00.58 /usr/local/hawq/bin/postgres 
-D /Users/wangchunling/hawq-data-directory/segmentdd -i -M segment -p 4 
--silent-mode=true
  501 30217 30216   0  2:34下午 ?? 0:00.03 postgres: port 4, logger 
process
  501 30231 30190   0  2:35下午 ?? 0:00.04 postgres: port  5432, 
wangchunling dispatch [local] con12 cmd6 idle [local]
  501 30248 30216   0  2:36下午 ?? 0:00.00 postgres: port 4, stats 
collector process
  501 30249 30216   0  2:36下午 ?? 0:00.00 postgres: port 4, writer 
process
  501 30250 30216   0  2:36下午 ?? 0:00.00 postgres: port 4, 
checkpoint process
  501 30251 30216   0  2:36下午 ?? 0:00.00 postgres: port 4, segment 
resource manager
  501 30255 99560   0  2:36下午 ttys0000:00.00 grep postgres
{code}
3. Run query again and get some new QEs.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)
{code}

{code}
$ ps -ef|grep postgres
  501 30190 1   0  2:34下午 ?? 0:00.33 /usr/local/hawq/bin/postgres 
-D /Users/wangchunling/hawq-data-directory/masterdd -i -M master -p 5432 
--silent-mode=true
  501 30191 30190   0  2:34下午 ?? 0:00.01 

[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-564:
---
Description: 
When first inject panic in QE process, we run a query and segment is down. 
After the segment is up, we run another query and get correct answer. Then we 
inject the same panic second time. After the segment is down and then up again, 
we run a query and find QD process hangs when connecting to resource manager. 
Here is the backtrace when QD hangs:
{code}
* thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
rmcomm_AsyncComm.c:156
frame #2: 0x000101db85f5 
postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) 
+ 645 at rmcomm_SyncComm.c:122
frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
frame #4: 0x000101db2d3c 
postgres`acquireResourceFromRM(index=, sessionid=12, 
slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
preferred_nodes_size=, max_seg_count_fix=, 
min_seg_count_fix=, errorbuf=, 
errorbufsize=) + 572 at rmcomm_QD2RM.c:742
frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
slice_size=5, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
at pquery.c:796
frame #6: 0x000101e8c60f 
postgres`calculate_planner_segment_num(query=, 
resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, 
sliceNum=5) + 14287 at cdbdatalocality.c:4207
frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
cursorOptions=, boundParams=0x, 
resourceLife=QRL_ONCE) + 311 at planner.c:310
frame #9: 0x000101c8eb33 
postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
postgres.c:911
frame #11: 0x000101c95699 
postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
argv=) + 5019 at postmaster.c:1454
frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
argv=0x7f9c19c1eef0) + 1433 at main.c:209
frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
2163 at ic_udp.c:6251
frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select 
+ 10
frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
78 at pgsleep.c:43
frame #2: 0x000101db1a66 
postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
rmcomm_QD2RM.c:1519
frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
{code}

And here is the operations:
1. Before injection, get query answer correctly.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)
{code}
2. Inject panic, fault triggered, and segment is down.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
ERROR:  fault triggered, fault name:'fail_qe_whe

[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-564:
---
Affects Version/s: 2.0.0

> QD hangs when connecting to resource manager
> 
>
> Key: HAWQ-564
> URL: https://issues.apache.org/jira/browse/HAWQ-564
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Resource Manager
>Affects Versions: 2.0.0
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> When first inject panic in QE process, we run a query and segment is down. 
> After the segment is up, we run another query and get correct answer. Then we 
> inject the same panic second time. After the segment is down and then up 
> again, we run a query and find QD process hangs when connecting to resource 
> manager. Here is the backtrace when QD hangs:
> {code}
> * thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
> 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>   * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
> frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
> rmcomm_AsyncComm.c:156
> frame #2: 0x000101db85f5 
> postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
> sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, 
> exprecvmsgid=2307, recvsmb=, errorbuf=0x00010230c1a0, 
> errorbufsize=) + 645 at rmcomm_SyncComm.c:122
> frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
> callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
> sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
> errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
> frame #4: 0x000101db2d3c 
> postgres`acquireResourceFromRM(index=, sessionid=12, 
> slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
> preferred_nodes_size=, max_seg_count_fix=, 
> min_seg_count_fix=, errorbuf=, 
> errorbufsize=) + 572 at rmcomm_QD2RM.c:742
> frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
> slice_size=5, iobytes=134217728, max_target_segment_num=1, 
> min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
> at pquery.c:796
> frame #6: 0x000101e8c60f 
> postgres`calculate_planner_segment_num(query=, 
> resourceLife=QRL_ONCE, fullRangeTable=, 
> intoPolicy=, sliceNum=5) + 14287 at cdbdatalocality.c:4207
> frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
> frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
> cursorOptions=, boundParams=0x, 
> resourceLife=QRL_ONCE) + 311 at planner.c:310
> frame #9: 0x000101c8eb33 
> postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
> boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
> frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
> postgres.c:911
> frame #11: 0x000101c95699 
> postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
> seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
> frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
> argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
> frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
> 105 at postmaster.c:5889
> frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
> frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
> postmaster.c:2163
> frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
> argv=) + 5019 at postmaster.c:1454
> frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
> argv=0x7f9c19c1eef0) + 1433 at main.c:209
> frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1
>   thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 
> 10
> frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
> frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
> 2163 at ic_udp.c:6251
> frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
> frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
> frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
>   thread #3: tid = 0x21d9c2, 0x7fff890343f6 
> libsystem_kernel.dylib`__select + 10
> frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
> frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
> 78 at pgsleep.c:43
> frame #2: 0x000101db1a66 
> postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
> rmcomm_QD2RM.c:1519
> frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
> frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
> frame #5: 0x7f

[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-564:
---
Description: 
When first inject panic in QE process, we run a query and segment is down. 
After the segment is up, we run another query and get correct answer. Then we 
inject the same panic second time. After the segment is down and then up again, 
we run a query and find QD process hangs when connecting to resource manager. 
Here is the backtrace when QD hangs:
{code}
* thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
rmcomm_AsyncComm.c:156
frame #2: 0x000101db85f5 
postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) 
+ 645 at rmcomm_SyncComm.c:122
frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
frame #4: 0x000101db2d3c 
postgres`acquireResourceFromRM(index=, sessionid=12, 
slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
preferred_nodes_size=, max_seg_count_fix=, 
min_seg_count_fix=, errorbuf=, 
errorbufsize=) + 572 at rmcomm_QD2RM.c:742
frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
slice_size=5, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
at pquery.c:796
frame #6: 0x000101e8c60f 
postgres`calculate_planner_segment_num(query=, 
resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, 
sliceNum=5) + 14287 at cdbdatalocality.c:4207
frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
cursorOptions=, boundParams=0x, 
resourceLife=QRL_ONCE) + 311 at planner.c:310
frame #9: 0x000101c8eb33 
postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
postgres.c:911
frame #11: 0x000101c95699 
postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
argv=) + 5019 at postmaster.c:1454
frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
argv=0x7f9c19c1eef0) + 1433 at main.c:209
frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
2163 at ic_udp.c:6251
frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select 
+ 10
frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
78 at pgsleep.c:43
frame #2: 0x000101db1a66 
postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
rmcomm_QD2RM.c:1519
frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
{code}

And here is the operations:
1. Before injection, get query answer correctly.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)
{code}
2. Inject panic, fault triggered, and segment is down.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
ERROR:  fault triggered, fault name:'fail_qe_whe

[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-564:
---
Description: 
When first inject panic in QE process, we run a query and segment is down. 
After the segment is up, we run another query and get correct answer. Then we 
inject the same panic second time. After the segment is down and then up again, 
we run a query and find QD process hangs when connecting to resource manager. 
Here is the backtrace when QD hangs:
{code}
* thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
rmcomm_AsyncComm.c:156
frame #2: 0x000101db85f5 
postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) 
+ 645 at rmcomm_SyncComm.c:122
frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
frame #4: 0x000101db2d3c 
postgres`acquireResourceFromRM(index=, sessionid=12, 
slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
preferred_nodes_size=, max_seg_count_fix=, 
min_seg_count_fix=, errorbuf=, 
errorbufsize=) + 572 at rmcomm_QD2RM.c:742
frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
slice_size=5, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
at pquery.c:796
frame #6: 0x000101e8c60f 
postgres`calculate_planner_segment_num(query=, 
resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, 
sliceNum=5) + 14287 at cdbdatalocality.c:4207
frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
cursorOptions=, boundParams=0x, 
resourceLife=QRL_ONCE) + 311 at planner.c:310
frame #9: 0x000101c8eb33 
postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
postgres.c:911
frame #11: 0x000101c95699 
postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
argv=) + 5019 at postmaster.c:1454
frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
argv=0x7f9c19c1eef0) + 1433 at main.c:209
frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
2163 at ic_udp.c:6251
frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select 
+ 10
frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
78 at pgsleep.c:43
frame #2: 0x000101db1a66 
postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
rmcomm_QD2RM.c:1519
frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
{code}

And here is the operations:
1. Before injection, get query answer correctly.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)
{code}
2. Inject panic, fault triggered, and segment is down.
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
ERROR:  fault triggered, fault name:'fail_qe_whe

[jira] [Updated] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-564:
---
Description: 
When first inject panic in QE process, we run a query and segment is down. 
After the segment is up, we run another query and get correct answer. Then we 
inject the same panic second time. After the segment is down and then up again, 
we run a query and find QD process hangs when connecting to resource manager. 
Here is the backtrace when QD hangs:
{code}
* thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
rmcomm_AsyncComm.c:156
frame #2: 0x000101db85f5 
postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) 
+ 645 at rmcomm_SyncComm.c:122
frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
frame #4: 0x000101db2d3c 
postgres`acquireResourceFromRM(index=, sessionid=12, 
slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
preferred_nodes_size=, max_seg_count_fix=, 
min_seg_count_fix=, errorbuf=, 
errorbufsize=) + 572 at rmcomm_QD2RM.c:742
frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
slice_size=5, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
at pquery.c:796
frame #6: 0x000101e8c60f 
postgres`calculate_planner_segment_num(query=, 
resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, 
sliceNum=5) + 14287 at cdbdatalocality.c:4207
frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
cursorOptions=, boundParams=0x, 
resourceLife=QRL_ONCE) + 311 at planner.c:310
frame #9: 0x000101c8eb33 
postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
postgres.c:911
frame #11: 0x000101c95699 
postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
argv=) + 5019 at postmaster.c:1454
frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
argv=0x7f9c19c1eef0) + 1433 at main.c:209
frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
2163 at ic_udp.c:6251
frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select 
+ 10
frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
78 at pgsleep.c:43
frame #2: 0x000101db1a66 
postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
rmcomm_QD2RM.c:1519
frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
{code}

And here is the operations:
{code}
dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
 count
---
  3725
(1 row)

dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, 
test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
ERROR:  fault triggered, fault name:'fail_qe_when_do_query' fault type:'panic' 
(faultinjector.c:656)  (seg0 localhost:4 pid=26936)
dispatch=# select count(*) fr

[jira] [Created] (HAWQ-564) QD hangs when connecting to resource manager

2016-03-20 Thread Chunling Wang (JIRA)
Chunling Wang created HAWQ-564:
--

 Summary: QD hangs when connecting to resource manager
 Key: HAWQ-564
 URL: https://issues.apache.org/jira/browse/HAWQ-564
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Resource Manager
Reporter: Chunling Wang
Assignee: Lei Chang


When first inject panic in QE process, we run a query and segment is down. 
After the segment is up, we run another query and get correct answer. Then we 
inject the same panic second time. After the segment is down and then up again, 
we run a query and find QD process hangs when connecting to resource manager. 
Here is the backtrace when QD hangs:
{code}
* thread #1: tid = 0x21d8be, 0x7fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101daeafe postgres`processAllCommFileDescs + 158 at 
rmcomm_AsyncComm.c:156
frame #2: 0x000101db85f5 
postgres`callSyncRPCRemote(hostname=0x7f9c19e00cd0, port=5437, 
sendbuff=0x7f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, 
recvsmb=, errorbuf=0x00010230c1a0, errorbufsize=) 
+ 645 at rmcomm_SyncComm.c:122
frame #3: 0x000101db2d85 postgres`acquireResourceFromRM [inlined] 
callSyncRPCToRM(sendbuff=0x7f9c1b918f50, sendbuffsize=, 
sendmsgid=259, exprecvmsgid=2307, recvsmb=0x7f9c1b918e70, 
errorbuf=, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
frame #4: 0x000101db2d3c 
postgres`acquireResourceFromRM(index=, sessionid=12, 
slice_size=462524016, iobytes=134217728, preferred_nodes=0x7f9c1a02d398, 
preferred_nodes_size=, max_seg_count_fix=, 
min_seg_count_fix=, errorbuf=, 
errorbufsize=) + 572 at rmcomm_QD2RM.c:742
frame #5: 0x000101c979e7 postgres`AllocateResource(life=QRL_ONCE, 
slice_size=5, iobytes=134217728, max_target_segment_num=1, 
min_target_segment_num=1, vol_info=0x7f9c1a02d398, vol_info_size=1) + 631 
at pquery.c:796
frame #6: 0x000101e8c60f 
postgres`calculate_planner_segment_num(query=, 
resourceLife=QRL_ONCE, fullRangeTable=, intoPolicy=, 
sliceNum=5) + 14287 at cdbdatalocality.c:4207
frame #7: 0x000101c0f671 postgres`planner + 106 at planner.c:496
frame #8: 0x000101c0f607 postgres`planner(parse=0x7f9c1a02a140, 
cursorOptions=, boundParams=0x, 
resourceLife=QRL_ONCE) + 311 at planner.c:310
frame #9: 0x000101c8eb33 
postgres`pg_plan_query(querytree=0x7f9c1a02a140, 
boundParams=0x, resource_life=QRL_ONCE) + 99 at postgres.c:837
frame #10: 0x000101c956ae postgres`exec_simple_query + 21 at 
postgres.c:911
frame #11: 0x000101c95699 
postgres`exec_simple_query(query_string=0x7f9c1a028a30, 
seqServerHost=0x, seqServerPort=-1) + 1577 at postgres.c:1671
frame #12: 0x000101c91a4c postgres`PostgresMain(argc=, 
argv=, username=0x7f9c1b808cf0) + 9404 at postgres.c:4754
frame #13: 0x000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
frame #14: 0x000101c4ad99 postgres`ServerLoop at postmaster.c:5484
frame #15: 0x000101c4ad99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
frame #16: 0x000101c47d3b postgres`PostmasterMain(argc=, 
argv=) + 5019 at postmaster.c:1454
frame #17: 0x000101bb1aa9 postgres`main(argc=9, 
argv=0x7f9c19c1eef0) + 1433 at main.c:209
frame #18: 0x7fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x21d8bf, 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x000101dfe723 postgres`rxThreadFunc(arg=) + 
2163 at ic_udp.c:6251
frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x21d9c2, 0x7fff890343f6 libsystem_kernel.dylib`__select 
+ 10
frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
frame #1: 0x000101e9d42e postgres`pg_usleep(microsec=) + 
78 at pgsleep.c:43
frame #2: 0x000101db1a66 
postgres`generateResourceRefreshHeartBeat(arg=0x7f9c19f02480) + 166 at 
rmcomm_QD2RM.c:1519
frame #3: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #4: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #5: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-559) QD hangs when QE is killed after connected to QD

2016-03-19 Thread Chunling Wang (JIRA)
Chunling Wang created HAWQ-559:
--

 Summary: QD hangs when QE is killed after connected to QD
 Key: HAWQ-559
 URL: https://issues.apache.org/jira/browse/HAWQ-559
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Dispatcher
Reporter: Chunling Wang
Assignee: Lei Chang


When the first query finishes, the QE is still alive. Then we run the second 
query. After the thread of QD is created and bind to QE but not send data to 
QE, we kill this QE and find QD hangs.
Here is the backtrace when QD hangs:
* thread #1: tid = 0x1c4afd, 0x7fff890355be libsystem_kernel.dylib`poll + 
10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x00010745692c postgres`receiveChunksUDP [inlined] 
udpSignalPoll + 42 at ic_udp.c:2882
frame #2: 0x000107456902 postgres`receiveChunksUDP + 26 at ic_udp.c:2715
frame #3: 0x0001074568e8 postgres`receiveChunksUDP [inlined] 
waitOnCondition(timeout_us=25) + 82 at ic_udp.c:1599
frame #4: 0x000107456896 
postgres`receiveChunksUDP(pTransportStates=0x7ff2a381ae48, 
pEntry=0x7ff2a18f2230, motNodeID=, 
srcRoute=0x7fff58c0ce96, conn=, inTeardown='\0') + 726 at 
ic_udp.c:4039
frame #5: 0x000107452a86 postgres`RecvTupleChunkFromAnyUDP [inlined] 
RecvTupleChunkFromAnyUDP_Internal + 498 at ic_udp.c:4146
frame #6: 0x000107452894 
postgres`RecvTupleChunkFromAnyUDP(mlStates=, 
transportStates=, motNodeID=1, srcRoute=0x7fff58c0ce96) + 100 
at ic_udp.c:4167
frame #7: 0x000107442254 postgres`RecvTupleFrom [inlined] 
processIncomingChunks(mlStates=0x7ff2a3812a30, 
transportStates=0x7ff2a381ae48, motNodeID=1, srcRoute=) + 34 
at cdbmotion.c:684
frame #8: 0x000107442232 
postgres`RecvTupleFrom(mlStates=0x7ff2a3812a30, 
transportStates=, motNodeID=1, tup_i=0x7fff58c0cf00, 
srcRoute=-100) + 370 at cdbmotion.c:610
frame #9: 0x0001071c8778 postgres`ExecMotion [inlined] 
execMotionUnsortedReceiver(node=) + 57 at nodeMotion.c:466
frame #10: 0x0001071c873f postgres`ExecMotion(node=) + 
1071 at nodeMotion.c:298
frame #11: 0x0001071a4835 
postgres`ExecProcNode(node=0x7ff2a38164b8) + 613 at execProcnode.c:999
frame #12: 0x0001071b9f82 postgres`ExecAgg + 104 at nodeAgg.c:1163
frame #13: 0x0001071b9f1a postgres`ExecAgg + 316 at nodeAgg.c:1693
frame #14: 0x0001071b9dde postgres`ExecAgg(node=0x7ff2a3815348) + 
126 at nodeAgg.c:1138
frame #15: 0x0001071a4803 
postgres`ExecProcNode(node=0x7ff2a3815348) + 563 at execProcnode.c:979
frame #16: 0x00010719ecfd 
postgres`ExecutePlan(estate=0x7ff2a3814e30, planstate=0x7ff2a3815348, 
operation=CMD_SELECT, numberTuples=0, direction=, 
dest=0x7ff2a28db178) + 1181 at execMain.c:3218
frame #17: 0x00010719e619 
postgres`ExecutorRun(queryDesc=0x7ff2a3811f00, 
direction=ForwardScanDirection, count=0) + 569 at execMain.c:1213
frame #18: 0x0001072e7fc2 postgres`PortalRun + 14 at pquery.c:1649
frame #19: 0x0001072e7fb4 postgres`PortalRun(portal=0x7ff2a1893e30, 
count=, isTopLevel='\x01', dest=, 
altdest=0x7ff2a28db178, completionTag=0x7fff58c0d530) + 1124 at 
pquery.c:1471
frame #20: 0x0001072e4a8e 
postgres`exec_simple_query(query_string=0x7ff2a380fe30, 
seqServerHost=0x, seqServerPort=-1) + 2078 at postgres.c:1745
frame #21: 0x0001072e0c4c postgres`PostgresMain(argc=, 
argv=, username=0x7ff2a201bcf0) + 9404 at postgres.c:4754
frame #22: 0x00010729a002 postgres`ServerLoop [inlined] BackendRun + 
105 at postmaster.c:5889
frame #23: 0x000107299f99 postgres`ServerLoop at postmaster.c:5484
frame #24: 0x000107299f99 postgres`ServerLoop + 9593 at 
postmaster.c:2163
frame #25: 0x000107296f3b postgres`PostmasterMain(argc=, 
argv=) + 5019 at postmaster.c:1454
frame #26: 0x000107200ca9 postgres`main(argc=9, 
argv=0x7ff2a141eef0) + 1433 at main.c:209
frame #27: 0x7fff95e8c5c9 libdyld.dylib`start + 1

  thread #2: tid = 0x1c4afe, 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
frame #1: 0x00010744d8e3 postgres`rxThreadFunc(arg=) + 
2163 at ic_udp.c:6251
frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
frame #3: 0x7fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
frame #4: 0x7fff95e804b1 libsystem_pthread.dylib`thread_start + 13

  thread #3: tid = 0x1c4b02, 0x7fff890343f6 libsystem_kernel.dylib`__select 
+ 10
frame #0: 0x7fff890343f6 libsystem_kernel.dylib`__select + 10
frame #1: 0x0001074ec47e postgres`pg_usleep(microsec=) + 
78 at pgsleep.c:43
frame #2: 0x000107400c26 
postgres`generateResourceRefreshHeartBeat(ar

[jira] [Updated] (HAWQ-559) QD hangs when QE is killed after connected to QD

2016-03-18 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-559:
---
Affects Version/s: 2.0.0
  Environment: mac os X 10.10

> QD hangs when QE is killed after connected to QD
> 
>
> Key: HAWQ-559
> URL: https://issues.apache.org/jira/browse/HAWQ-559
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Affects Versions: 2.0.0
> Environment: mac os X 10.10
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> When the first query finishes, the QE is still alive. Then we run the second 
> query. After the thread of QD is created and bind to QE but not send data to 
> QE, we kill this QE and find QD hangs.
> Here is the backtrace when QD hangs:
> * thread #1: tid = 0x1c4afd, 0x7fff890355be libsystem_kernel.dylib`poll + 
> 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>   * frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
> frame #1: 0x00010745692c postgres`receiveChunksUDP [inlined] 
> udpSignalPoll + 42 at ic_udp.c:2882
> frame #2: 0x000107456902 postgres`receiveChunksUDP + 26 at 
> ic_udp.c:2715
> frame #3: 0x0001074568e8 postgres`receiveChunksUDP [inlined] 
> waitOnCondition(timeout_us=25) + 82 at ic_udp.c:1599
> frame #4: 0x000107456896 
> postgres`receiveChunksUDP(pTransportStates=0x7ff2a381ae48, 
> pEntry=0x7ff2a18f2230, motNodeID=, 
> srcRoute=0x7fff58c0ce96, conn=, inTeardown='\0') + 726 at 
> ic_udp.c:4039
> frame #5: 0x000107452a86 postgres`RecvTupleChunkFromAnyUDP [inlined] 
> RecvTupleChunkFromAnyUDP_Internal + 498 at ic_udp.c:4146
> frame #6: 0x000107452894 
> postgres`RecvTupleChunkFromAnyUDP(mlStates=, 
> transportStates=, motNodeID=1, srcRoute=0x7fff58c0ce96) + 
> 100 at ic_udp.c:4167
> frame #7: 0x000107442254 postgres`RecvTupleFrom [inlined] 
> processIncomingChunks(mlStates=0x7ff2a3812a30, 
> transportStates=0x7ff2a381ae48, motNodeID=1, srcRoute=) + 34 
> at cdbmotion.c:684
> frame #8: 0x000107442232 
> postgres`RecvTupleFrom(mlStates=0x7ff2a3812a30, 
> transportStates=, motNodeID=1, tup_i=0x7fff58c0cf00, 
> srcRoute=-100) + 370 at cdbmotion.c:610
> frame #9: 0x0001071c8778 postgres`ExecMotion [inlined] 
> execMotionUnsortedReceiver(node=) + 57 at nodeMotion.c:466
> frame #10: 0x0001071c873f postgres`ExecMotion(node=) + 
> 1071 at nodeMotion.c:298
> frame #11: 0x0001071a4835 
> postgres`ExecProcNode(node=0x7ff2a38164b8) + 613 at execProcnode.c:999
> frame #12: 0x0001071b9f82 postgres`ExecAgg + 104 at nodeAgg.c:1163
> frame #13: 0x0001071b9f1a postgres`ExecAgg + 316 at nodeAgg.c:1693
> frame #14: 0x0001071b9dde postgres`ExecAgg(node=0x7ff2a3815348) + 
> 126 at nodeAgg.c:1138
> frame #15: 0x0001071a4803 
> postgres`ExecProcNode(node=0x7ff2a3815348) + 563 at execProcnode.c:979
> frame #16: 0x00010719ecfd 
> postgres`ExecutePlan(estate=0x7ff2a3814e30, planstate=0x7ff2a3815348, 
> operation=CMD_SELECT, numberTuples=0, direction=, 
> dest=0x7ff2a28db178) + 1181 at execMain.c:3218
> frame #17: 0x00010719e619 
> postgres`ExecutorRun(queryDesc=0x7ff2a3811f00, 
> direction=ForwardScanDirection, count=0) + 569 at execMain.c:1213
> frame #18: 0x0001072e7fc2 postgres`PortalRun + 14 at pquery.c:1649
> frame #19: 0x0001072e7fb4 
> postgres`PortalRun(portal=0x7ff2a1893e30, count=, 
> isTopLevel='\x01', dest=, altdest=0x7ff2a28db178, 
> completionTag=0x7fff58c0d530) + 1124 at pquery.c:1471
> frame #20: 0x0001072e4a8e 
> postgres`exec_simple_query(query_string=0x7ff2a380fe30, 
> seqServerHost=0x, seqServerPort=-1) + 2078 at postgres.c:1745
> frame #21: 0x0001072e0c4c postgres`PostgresMain(argc=, 
> argv=, username=0x7ff2a201bcf0) + 9404 at postgres.c:4754
> frame #22: 0x00010729a002 postgres`ServerLoop [inlined] BackendRun + 
> 105 at postmaster.c:5889
> frame #23: 0x000107299f99 postgres`ServerLoop at postmaster.c:5484
> frame #24: 0x000107299f99 postgres`ServerLoop + 9593 at 
> postmaster.c:2163
> frame #25: 0x000107296f3b postgres`PostmasterMain(argc=, 
> argv=) + 5019 at postmaster.c:1454
> frame #26: 0x000107200ca9 postgres`main(argc=9, 
> argv=0x7ff2a141eef0) + 1433 at main.c:209
> frame #27: 0x7fff95e8c5c9 libdyld.dylib`start + 1
>   thread #2: tid = 0x1c4afe, 0x7fff890355be libsystem_kernel.dylib`poll + 
> 10
> frame #0: 0x7fff890355be libsystem_kernel.dylib`poll + 10
> frame #1: 0x00010744d8e3 postgres`rxThreadFunc(arg=) + 
> 2163 at ic_udp.c:6251
> frame #2: 0x7fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
> frame #3: 0x000

[jira] [Updated] (HAWQ-523) Dead code in executormgr_bind_executor_task()

2016-03-15 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-523:
---
Summary: Dead code in executormgr_bind_executor_task()  (was: dead code in 
executormgr_bind_executor_task())

> Dead code in executormgr_bind_executor_task()
> -
>
> Key: HAWQ-523
> URL: https://issues.apache.org/jira/browse/HAWQ-523
> Project: Apache HAWQ
>  Issue Type: New Feature
>  Components: Dispatcher
>Affects Versions: 2.0.0
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> In executormgr.c, the code below would never access:
> bool
> executormgr_bind_executor_task(struct DispatchData *data,
>   QueryExecutor *executor,
>   
> SegmentDatabaseDescriptor *desc,
>   struct DispatchTask 
> *task,
>   struct DispatchSlice 
> *slice)
> {
>   ...
>   if (desc == NULL)
>   {
>   executor->health = QEH_ERROR;
>   return false;
>   }
>   ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HAWQ-524) Do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task()

2016-03-15 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-524:
---
Summary: Do not resolve the condition of 'executor->refResult = NULL' in 
executormgr_bind_executor_task()   (was: do not resolve the condition of 
'executor->refResult = NULL' in executormgr_bind_executor_task() )

> Do not resolve the condition of 'executor->refResult = NULL' in 
> executormgr_bind_executor_task() 
> -
>
> Key: HAWQ-524
> URL: https://issues.apache.org/jira/browse/HAWQ-524
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Affects Versions: 2.0.0
>Reporter: Chunling Wang
>Assignee: Lili Ma
> Fix For: 2.0.0
>
>
> In executormgr.c, the code below should not be Assert(). The condition of 
> 'executor->refResult = NULL' should be catch.
> bool
> executormgr_bind_executor_task(struct DispatchData *data,
>   QueryExecutor *executor,
>   
> SegmentDatabaseDescriptor *desc,
>   struct DispatchTask 
> *task,
>   struct DispatchSlice 
> *slice)
> {
>   ...
>   Assert(executor->refResult != NULL);
>   ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HAWQ-539) Improve code coverage for dispatcher: connection_fail_after_gang_creation& create_cdb_dispath_result_object& dispmgt_concurrent_connect

2016-03-15 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-539:
---
Summary: Improve code coverage for dispatcher: 
connection_fail_after_gang_creation& create_cdb_dispath_result_object& 
dispmgt_concurrent_connect  (was: Add fault injection for dispatcher: 
connection_fail_after_gang_creation& create_cdb_dispath_result_object& 
dispmgt_concurrent_connect)

> Improve code coverage for dispatcher: connection_fail_after_gang_creation& 
> create_cdb_dispath_result_object& dispmgt_concurrent_connect
> ---
>
> Key: HAWQ-539
> URL: https://issues.apache.org/jira/browse/HAWQ-539
> Project: Apache HAWQ
>  Issue Type: Sub-task
>  Components: Dispatcher
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> add three fault injections below:
> 1. connection_fail_after_gang_creation
> In function dispatcher_bind_executor() of dispatcher.c, we inject faults 
> before connection rebind.
> #ifdef FAULT_INJECTOR
>   FaultInjector_InjectFaultIfSet(
>   
>ConnectionFailAfterGangCreation,
>   
>DDLNotSpecified,
>   
>"",  // databaseName
>   
>""); // tableName
> #endif
> 2. create_cdb_dispath_result_object
> In function cdbdisp_makeResult() of cdbdispatchresult.c, we inject 
> out-of-memory before calling PQExpBufferBroken().
> #ifdef FAULT_INJECTOR
>   FaultInjector_InjectFaultIfSet(
>   
>CreateCdbDispathResultObject,
>   
>DDLNotSpecified,
>   
>"",  // databaseName
>   
>""); // tableName
> #endif
> 3. worker_manager_submit_job
> Inject error in function workermgr_submit_job() of workermgr.c.
> #ifdef FAULT_INJECTOR
>   FaultInjector_InjectFaultIfSet(
>   
>WorkerManagerSubmitJob,
>   
>DDLNotSpecified,
>   
>"",  // databaseName
>   
>""); // tableName
> #endif



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-539) Add fault injection for dispatcher: connection_fail_after_gang_creation& create_cdb_dispath_result_object& dispmgt_concurrent_connect

2016-03-14 Thread Chunling Wang (JIRA)
Chunling Wang created HAWQ-539:
--

 Summary: Add fault injection for dispatcher: 
connection_fail_after_gang_creation& create_cdb_dispath_result_object& 
dispmgt_concurrent_connect
 Key: HAWQ-539
 URL: https://issues.apache.org/jira/browse/HAWQ-539
 Project: Apache HAWQ
  Issue Type: Sub-task
  Components: Dispatcher
Reporter: Chunling Wang
Assignee: Lei Chang


add three fault injections below:
1. connection_fail_after_gang_creation
In function dispatcher_bind_executor() of dispatcher.c, we inject faults before 
connection rebind.
#ifdef FAULT_INJECTOR
FaultInjector_InjectFaultIfSet(

   ConnectionFailAfterGangCreation,

   DDLNotSpecified,

   "",  // databaseName

   ""); // tableName
#endif

2. create_cdb_dispath_result_object
In function cdbdisp_makeResult() of cdbdispatchresult.c, we inject 
out-of-memory before calling PQExpBufferBroken().
#ifdef FAULT_INJECTOR
FaultInjector_InjectFaultIfSet(

   CreateCdbDispathResultObject,

   DDLNotSpecified,

   "",  // databaseName

   ""); // tableName
#endif

3. worker_manager_submit_job
Inject error in function workermgr_submit_job() of workermgr.c.
#ifdef FAULT_INJECTOR
FaultInjector_InjectFaultIfSet(

   WorkerManagerSubmitJob,

   DDLNotSpecified,

   "",  // databaseName

   ""); // tableName
#endif



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-538) Add fault injection for dispatcher

2016-03-14 Thread Chunling Wang (JIRA)
Chunling Wang created HAWQ-538:
--

 Summary: Add fault injection for dispatcher
 Key: HAWQ-538
 URL: https://issues.apache.org/jira/browse/HAWQ-538
 Project: Apache HAWQ
  Issue Type: New Feature
  Components: Dispatcher
Reporter: Chunling Wang
Assignee: Lei Chang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HAWQ-524) do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task()

2016-03-13 Thread Chunling Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HAWQ-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192821#comment-15192821
 ] 

Chunling Wang commented on HAWQ-524:


In cdbdispatcheresult.c, when dispatchResult->resultbuf == NULL, there is no 
need to free the PGresult objects again in function cdbdisp_resetResult(). 
Change the code like below:
void
cdbdisp_resetResult(CdbDispatchResult  *dispatchResult)
{
if (dispatchResult->resultbuf != NULL)
{
PQExpBuffer buf = dispatchResult->resultbuf;
PGresult  **begp = (PGresult **)buf->data;
PGresult  **endp = (PGresult **)(buf->data + buf->len);
PGresult  **p;

/* Free the PGresult objects. */
for (p = begp; p < endp; ++p)
{
Assert(*p != NULL);
PQclear(*p);
}
}
...
}

> do not resolve the condition of 'executor->refResult = NULL' in 
> executormgr_bind_executor_task() 
> -
>
> Key: HAWQ-524
> URL: https://issues.apache.org/jira/browse/HAWQ-524
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Affects Versions: 2.0.0
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> In executormgr.c, the code below should not be Assert(). The condition of 
> 'executor->refResult = NULL' should be catch.
> bool
> executormgr_bind_executor_task(struct DispatchData *data,
>   QueryExecutor *executor,
>   
> SegmentDatabaseDescriptor *desc,
>   struct DispatchTask 
> *task,
>   struct DispatchSlice 
> *slice)
> {
>   ...
>   Assert(executor->refResult != NULL);
>   ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HAWQ-524) do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task()

2016-03-13 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-524:
---
Affects Version/s: 2.0.0

> do not resolve the condition of 'executor->refResult = NULL' in 
> executormgr_bind_executor_task() 
> -
>
> Key: HAWQ-524
> URL: https://issues.apache.org/jira/browse/HAWQ-524
> Project: Apache HAWQ
>  Issue Type: Bug
>  Components: Dispatcher
>Affects Versions: 2.0.0
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> In executormgr.c, the code below should not be Assert(). The condition of 
> 'executor->refResult = NULL' should be catch.
> bool
> executormgr_bind_executor_task(struct DispatchData *data,
>   QueryExecutor *executor,
>   
> SegmentDatabaseDescriptor *desc,
>   struct DispatchTask 
> *task,
>   struct DispatchSlice 
> *slice)
> {
>   ...
>   Assert(executor->refResult != NULL);
>   ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-524) do not resolve the condition of 'executor->refResult = NULL' in executormgr_bind_executor_task()

2016-03-13 Thread Chunling Wang (JIRA)
Chunling Wang created HAWQ-524:
--

 Summary: do not resolve the condition of 'executor->refResult = 
NULL' in executormgr_bind_executor_task() 
 Key: HAWQ-524
 URL: https://issues.apache.org/jira/browse/HAWQ-524
 Project: Apache HAWQ
  Issue Type: Bug
  Components: Dispatcher
Reporter: Chunling Wang
Assignee: Lei Chang


In executormgr.c, the code below should not be Assert(). The condition of 
'executor->refResult = NULL' should be catch.
bool
executormgr_bind_executor_task(struct DispatchData *data,
QueryExecutor *executor,

SegmentDatabaseDescriptor *desc,
struct DispatchTask 
*task,
struct DispatchSlice 
*slice)
{
...
Assert(executor->refResult != NULL);
...
}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HAWQ-523) dead code in executormgr_bind_executor_task()

2016-03-13 Thread Chunling Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HAWQ-523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chunling Wang updated HAWQ-523:
---
Affects Version/s: 2.0.0

> dead code in executormgr_bind_executor_task()
> -
>
> Key: HAWQ-523
> URL: https://issues.apache.org/jira/browse/HAWQ-523
> Project: Apache HAWQ
>  Issue Type: New Feature
>  Components: Dispatcher
>Affects Versions: 2.0.0
>Reporter: Chunling Wang
>Assignee: Lei Chang
>
> In executormgr.c, the code below would never access:
> bool
> executormgr_bind_executor_task(struct DispatchData *data,
>   QueryExecutor *executor,
>   
> SegmentDatabaseDescriptor *desc,
>   struct DispatchTask 
> *task,
>   struct DispatchSlice 
> *slice)
> {
>   ...
>   if (desc == NULL)
>   {
>   executor->health = QEH_ERROR;
>   return false;
>   }
>   ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HAWQ-523) dead code in executormgr_bind_executor_task()

2016-03-13 Thread Chunling Wang (JIRA)
Chunling Wang created HAWQ-523:
--

 Summary: dead code in executormgr_bind_executor_task()
 Key: HAWQ-523
 URL: https://issues.apache.org/jira/browse/HAWQ-523
 Project: Apache HAWQ
  Issue Type: New Feature
  Components: Dispatcher
Reporter: Chunling Wang
Assignee: Lei Chang


In executormgr.c, the code below would never access:

bool
executormgr_bind_executor_task(struct DispatchData *data,
QueryExecutor *executor,

SegmentDatabaseDescriptor *desc,
struct DispatchTask 
*task,
struct DispatchSlice 
*slice)
{
...
if (desc == NULL)
{
executor->health = QEH_ERROR;
return false;
}
...
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2